## Transformers:

#### Defination: A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).

###### Source: https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)

## Bidirectional Encoder Representations from Transformers (BERT) Model: NLP

#### *Defination*: It is based on Transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection. 

##### It was trained by Google on 2500M words in WikiPedia and trained the model on two approaches: 1. Mass Training model, 2. Next Statement Prediction. Google Search is powered by BERT model.

###### Source: https://www.youtube.com/watch?v=7kLi8u2dJz0&ab_channel=codebasics

In [3]:
# Importing the desired/required tensorflow libraries:

import tensorflow_hub as hub
import tensorflow_text as text

In [4]:
# Getting the useful models from tenserflow_hub repo:

pre_processor_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

In [5]:
# Wraps a SavedModel (or a legacy TF1 Hub format) as a Keras Layer:

bert_pp_model = hub.KerasLayer(pre_processor_url)

In [6]:
# Here, we have a test text which we are storing in test_text variable and using that to pass 
# and test it in the tensorflow model and return the keys of the dict after the model pre_processes it:

text_test = ['nice_movie_indeed', 'I love Python Programming']
text_pre_processed = bert_pp_model(text_test)
text_pre_processed.keys()

dict_keys(['input_word_ids', 'input_type_ids', 'input_mask'])

In [7]:
# What are we doing:

# text_pre_processed['input_mask']
# text_pre_processed['input_type_ids']
text_pre_processed['input_word_ids']

# The way BERT works is, it will always put a special token ahead of it called CLS 
# and one separator token at the end called SEP.
# Maxumum length of a sentence: 128
# CLS nice_movie_indeed SEP

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[  101,  3835,  1035,  3185,  1035,  5262,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0, 

In [8]:
# Here, we are creating another layer after pre-processing using encor_url, which gives us three output:

bert_model = hub.KerasLayer(encoder_url)
bert_results = bert_model(text_pre_processed)
bert_results.keys()

dict_keys(['pooled_output', 'sequence_output', 'encoder_outputs', 'default'])

In [9]:
# Here we are checking the embedding. The vector size is 768 for 'nice_movie_indeed'. Similarly for other 

bert_results['pooled_output']

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.8119873 , -0.26844186, -0.06845339, ..., -0.13436879,
        -0.52355427,  0.8427556 ],
       [-0.91712296, -0.4793517 , -0.7865697 , ..., -0.6175173 ,
        -0.7102685 ,  0.92184293]], dtype=float32)>

In [10]:
bert_results['sequence_output']

# Since it's contextualized encoding so padding also has values:
# nice movie indeed 0 0 0 0 0 <-- 128

<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[-0.04719081,  0.11078794,  0.01651536, ..., -0.17195085,
          0.19408041,  0.09896731],
        [ 0.3141219 , -0.4812023 ,  0.7276414 , ..., -0.04402124,
          0.4528351 , -0.23027787],
        [ 0.553434  ,  0.2630476 ,  1.1952829 , ..., -0.48282036,
         -0.5119686 , -0.296567  ],
        ...,
        [ 0.15611318, -0.03490545,  0.63516194, ..., -0.06585784,
          0.02830975,  0.07551166],
        [-0.04971653, -0.02078105,  0.64803123, ..., -0.10641124,
          0.01159183,  0.1212726 ],
        [ 0.14216903, -0.02574774,  0.63996345, ..., -0.02877554,
          0.04031472,  0.00215075]],

       [[-0.0790059 ,  0.3633513 , -0.21101557, ..., -0.1718373 ,
          0.16299753,  0.6724265 ],
        [ 0.27883515,  0.43716335, -0.3576473 , ..., -0.04463643,
          0.38315186,  0.5887984 ],
        [ 1.2037671 ,  1.0727018 ,  0.4840877 , ...,  0.24921034,
          0.40730911,  0.4048181 ],
        ...,

In [11]:
# Here, we are checking the length of the encoder output, each layer of the 12 layers has 768 size embedding vector:
len(bert_results['encoder_outputs'])

12

In [12]:
# Here, we are checking the encoder output of each of the 12 layer:

bert_results['encoder_outputs'][0]

<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[ 0.10655081,  0.0246375 ,  0.04212973, ...,  0.062705  ,
          0.04764155, -0.09229398],
        [ 1.0485204 ,  1.0608004 ,  1.3805104 , ...,  0.15818965,
         -0.2914644 , -0.4520943 ],
        [ 0.33819932, -0.01024941,  0.8076432 , ...,  0.19576207,
          0.346847  , -1.2290275 ],
        ...,
        [-0.03990605, -0.26347458,  0.7489512 , ...,  0.35646957,
         -0.31062606,  0.08526935],
        [-0.12449503, -0.29239565,  0.61743605, ...,  0.38330048,
         -0.22637336, -0.01101273],
        [-0.00385755, -0.18815611,  0.6656786 , ...,  0.7007718 ,
         -0.5560532 , -0.11448625]],

       [[ 0.1890359 ,  0.02752548, -0.0651374 , ..., -0.00620213,
          0.15053894,  0.03165445],
        [ 0.5916149 ,  0.7589137 , -0.07240661, ...,  0.6190394 ,
          0.8292891 ,  0.16161954],
        [ 1.4460827 ,  0.44602644,  0.4099025 , ...,  0.48255914,
          0.62691146,  0.13463417],
        ...,

In [13]:
# last encoder output is same as the sequence output:

bert_results['encoder_outputs'][-1] == bert_results['sequence_output']

<tf.Tensor: shape=(2, 128, 768), dtype=bool, numpy=
array([[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]]])>

## Named Entity Recognition (NER) Model: NLP

#### Defination: NER is primarily used to extract different yet specific named entities in the text fed into this model. Example of entities can be person, product, company, etc from that particular text data.

#### NER can be used in the following:

#### 1. Search/Text based recommendation such as News articles, Google search, Customer Support, etc
#### 2. Movie recommendation such as Hotstar, Netflix, etc

#### Here, we will be using a custom_spacy model(pre-tranined model) to understand the coding aspects:

In [14]:
# Importing the libraries and the model:

import spacy
nlp = spacy.load("en_core_web_sm")

In [15]:
# We will now see NLP pipeline-names:

nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [16]:
# Here, we perform NER analysis of the text input that we have using custom spacy model:

doc = nlp("Tesla Inc is going to acquire Twitter, Inc for $45 billion")

# So, doc.ents contains all the entities in the above statement:
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Tesla Inc | ORG | Companies, agencies, institutions, etc.
Twitter, Inc | ORG | Companies, agencies, institutions, etc.
$45 billion | MONEY | Monetary values, including unit


In [17]:
# Here, lets' do the vizualized beautified labelling using the NER spacy's, displacy module:

from spacy import displacy
displacy.render(doc, style="ent")
# displacy.render(doc, style="dep")

#### So, our finding is that even the NER mechanism associared with Custom Spacy model or Spacy model is not accurate/perfect, it does have some issues. It is using some specific rules and if that fails, we it failing.

In [18]:
# Let's see all the entities that spacy supports:

nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

##### List of entities are also documented on this page: https://spacy.io/models/en

In [19]:
# Here we try another example of NER using spacy:

from spacy import displacy

doc = nlp("Michael Bloomberg founded Bloomberg L.P. in 1982")

for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))      
displacy.render(doc, style="ent")

Michael Bloomberg | PERSON | People, including fictional
Bloomberg L.P. | ORG | Companies, agencies, institutions, etc.
1982 | DATE | Absolute or relative dates or periods


#### For the above also, it made a mistake in identifying Bloomberg the company. Let's try hugging face for this now.

https://huggingface.co/dslim/bert-base-NER

In [20]:
# Let's discuss what's a Span: 

doc = nlp("Tesla is going to acquire Twitter for $45 billion")

# So, doc.ents contains all the entities in the above statement:
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Twitter | ORG | Companies, agencies, institutions, etc.
$45 billion | MONEY | Monetary values, including unit


In [21]:
print(type(doc[0]))
print(type(doc[2:5]))

<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.span.Span'>


In [22]:
# Here, we are checking our knowledge by testing the Span module under Spacy:

from spacy.tokens import Span

s1 = Span(doc, 0, 1, label='ORG')
s2 = Span(doc, 5, 6, label='ORG')

doc.set_ents([s1, s2], default='unmodified')

In [23]:
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_)) 

Tesla | ORG | Companies, agencies, institutions, etc.
Twitter | ORG | Companies, agencies, institutions, etc.
$45 billion | MONEY | Monetary values, including unit


In [24]:
# !pip install "tensorflow>=2.0.0"
# !pip install --upgrade tensorflow-hub
# !pip install --upgrade tensorflow_text
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [25]:
# pip list

## K-Fold Cross Validation Model: ML

#### Defination: When we have a large data-set which needs to be classified as Spam vs Ham(Not Spam), we need to perform train, test/we need to split the data into train and test for model evaluation.

#### Let's take an example, let's say we have a data-set of 100 df's and we want to train it as 70-30(train-test), then only data similar to that ratio will be getting calssified but whenever the data sees new data it will not be able to adapt and perform as per the new data. So, to make it more robust and efficient, we need to perform K-Fold Cross Validation where we basically spilt the data into 80-20(trian-test) ratio and for each fold we keep changing the testing data-set w.r.t. training data-set.

In [26]:
# Importing required ML libraries:

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

In [27]:
# Load and return the digits dataset(classification):
digits = load_digits()

In [28]:
# We import model selection for spliting our data into train_test_split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

In [29]:
# Logistic Regression

lr = LogisticRegression(solver='liblinear',multi_class='ovr')
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9574074074074074

In [30]:
# SVM

svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.43148148148148147

In [31]:
# Random Forest

rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9574074074074074

##### Here, we see that the scores from Logistic Regression, Random Forest are giving us reliable and comparable values while SVM is giving us relatively lower scores. So, we can conclude from here that we need a much more reliable model for getting scores.

## K-Fold Cross Validation

In [38]:
# Basic example

from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


In [33]:
# For our use case we define a method(we can use KFold for our digits example):

def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [60]:
# Similar to the steps performed while performing/checking model validation in K-Fold validation, 
# which we will perform using Stratified K-fold valiation(This is for understanding purposes, we can use cross_val_score):

from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digits.data,digits.target):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], \
                                       digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=60), X_train, X_test, y_train, y_test))

In [61]:
scores_logistic

[0.8948247078464107, 0.9532554257095158, 0.9098497495826378]

In [62]:
scores_svm

[0.3806343906510851, 0.41068447412353926, 0.5125208681135225]

In [63]:
scores_rf

[0.9382303839732888, 0.9515859766277128, 0.9232053422370617]

##### Here, we see that the scores from Logistic Regression, Random Forest are giving us reliable and comparable values while SVM is giving us relatively lower scores. So, we can conclude from here that we need a much more reliable model for getting scores.

cross_val_