<a href="https://colab.research.google.com/github/C-JoGit/Zalando_Analytics_HWR_BIPM/blob/main/Exercises/Exercise4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 4. Text Representation Part 2



In this exercise we will apply the following models to the stemmed data from Exercise 2:

1.   Word2Vec
2.   Doc2vec
3.   BERT

At the end, we will derive a corpus with each of them which can be used in downstream tasks such as classification and clustering (see next exercises).


In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/b0/9e/5b80becd952d5f7250eaf8fc64b957077b12ccfe73e9c03d37146ab29712/transformers-4.6.0-py3-none-any.whl (2.3MB)
[K     |▏                               | 10kB 23.0MB/s eta 0:00:01[K     |▎                               | 20kB 27.4MB/s eta 0:00:01[K     |▍                               | 30kB 31.0MB/s eta 0:00:01[K     |▋                               | 40kB 33.0MB/s eta 0:00:01[K     |▊                               | 51kB 35.3MB/s eta 0:00:01[K     |▉                               | 61kB 36.7MB/s eta 0:00:01[K     |█                               | 71kB 36.9MB/s eta 0:00:01[K     |█▏                              | 81kB 36.4MB/s eta 0:00:01[K     |█▎                              | 92kB 37.7MB/s eta 0:00:01[K     |█▍                              | 102kB 38.1MB/s eta 0:00:01[K     |█▌                              | 112kB 38.1MB/s eta 0:00:01[K     |█▊                              | 

In [2]:
# Import packages
import pickle
import pandas as pd
from gensim.models import Word2Vec
from scipy.spatial.distance import cosine
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import tensorflow as tf
import torch
from transformers import BertTokenizer, BertModel
from keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

## 0. Load data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
# Import dataset
data_lemma=pickle.load(open("/content/drive/MyDrive/Colab Notebooks/lemma.pkl", "rb"))
print(data_lemma[0])

car wonder enlighten car see day door sport car look late early call bricklin door small addition bumper separate rest body know tellme model engine specs year production car history info funky looking car mail thank


## 1. Word2Vec


In this section we will train the word2vec model on the lemmatized data. 


In [8]:
# Prepare the dataset for the word2vec model
corpus_gen=[doc.split() for doc in data_lemma]

# Train the model for embeddings of size 100 considering words appearing in more than 566 documents, default window=5
model = Word2Vec(corpus_gen, size=100, min_count=566)
model.save('word2vec.model')

In [9]:
print([i for i in sorted(model.wv.vocab.keys())])

['able', 'accept', 'access', 'act', 'action', 'actually', 'add', 'address', 'advance', 'ago', 'agree', 'allow', 'american', 'answer', 'anybody', 'appear', 'apple', 'application', 'apply', 'appreciate', 'apr', 'april', 'area', 'argument', 'armenian', 'armenians', 'article', 'ask', 'assume', 'atheist', 'attack', 'available', 'away', 'bad', 'base', 'begin', 'believe', 'bible', 'big', 'bike', 'bit', 'black', 'board', 'body', 'book', 'box', 'break', 'bring', 'build', 'buy', 'call', 'car', 'card', 'care', 'carry', 'case', 'cause', 'center', 'certain', 'certainly', 'change', 'check', 'child', 'chip', 'christ', 'christian', 'christians', 'church', 'city', 'claim', 'clear', 'clinton', 'clipper', 'close', 'code', 'color', 'com', 'come', 'comment', 'company', 'consider', 'contact', 'contain', 'continue', 'control', 'copy', 'correct', 'cost', 'country', 'couple', 'course', 'cover', 'create', 'crime', 'current', 'data', 'date', 'datum', 'david', 'day', 'deal', 'death', 'decide', 'design', 'device',

In [10]:
# Embedding for 'car'
vector = model.wv['car']
vector

array([ 0.6109092 ,  1.3330501 , -1.3809166 ,  0.76868606,  0.06161551,
       -0.21493743,  0.08975235,  0.6029217 ,  1.0982993 ,  0.07122575,
       -1.3275511 , -1.0173836 ,  0.45278516, -1.2391918 , -0.6974798 ,
        0.56633997,  1.5104824 ,  0.6253911 ,  0.9597482 , -1.5507442 ,
       -1.4778956 , -1.0247847 , -0.7961584 , -0.5562887 ,  0.63455874,
       -1.1249799 , -1.7093805 ,  0.2777721 ,  1.2156094 , -0.85552096,
        0.19843642,  1.7316579 , -0.30975705, -0.06821839, -1.2790812 ,
        0.3186357 , -0.5643113 , -1.4352485 ,  0.08280188,  0.14729495,
        0.21802603, -0.9665456 ,  1.1950902 ,  0.51323384,  0.6008748 ,
        0.24895026, -1.425857  ,  1.0360487 ,  0.17624019, -0.84510106,
       -0.22483894,  0.60674095, -0.3922483 ,  0.82571435,  0.804116  ,
        0.33028728,  0.13911949,  1.6067307 , -0.55362195, -0.89019895,
       -0.22220126,  0.44630054,  0.8836802 , -1.8297411 ,  0.52251315,
       -0.5769102 , -0.48168692, -1.1687733 , -1.6857878 , -1.54

In [11]:
# Most similar representations to 'car' based on cosine similarity
model.wv.most_similar('car')

[('bike', 0.6022545695304871),
 ('buy', 0.5634679794311523),
 ('get', 0.5026164054870605),
 ('speed', 0.4842330813407898),
 ('light', 0.4688034653663635),
 ('turn', 0.45954322814941406),
 ('pay', 0.4526820480823517),
 ('friend', 0.4401146173477173),
 ('guy', 0.4394468069076538),
 ('figure', 0.4387744069099426)]

In [12]:
# Embeddings' arithmetics
model.wv.most_similar(positive=['bike', 'machine'], topn=1)

[('buy', 0.6576911211013794)]

In the following we will derive the corpus. Note that word2vec (as opposed to doc2vec) generates one embedding for each word in the document. These then need to be aggregated at a document level. The simplest way is to determine the average over all words, but you can also use other aggregators.

In [13]:
# Document representation for the text
corpus_w2v=[[model.wv[word] for word in doc if word in model.wv.vocab.keys()] for doc in corpus_gen]
positive=[i for i in range(len(corpus_gen)) if len(corpus_w2v[i])>0]

corpus_w2v2=[corpus_w2v[i] for i in positive]
data_lemma2=[data_lemma[i] for i in positive]

# Document average representation
corpus_w2v_avg_clean=[sum(words)/len(words) for words in corpus_w2v2]

# This corpus can be used later in clustering and classification tasks
print(corpus_w2v_avg_clean[10])

[ 0.14873949  0.02097726 -0.03202476  0.01849939 -0.50044674 -0.661097
  0.47235054  0.23082788  0.2311119  -0.37507364 -0.1644413  -0.41786528
 -0.49764758 -0.16024014 -0.18873264  0.03298237  0.3168346   0.01432757
  0.06435312 -0.5995266  -0.03951619  0.0548044  -0.43617597  0.08818642
  0.2267098   0.03535873 -0.52702165 -0.03133269 -0.2134932  -0.30461583
 -0.0557395   0.58988094  0.02019771 -0.12913556 -0.09264978  0.24552904
  0.18652423  0.21345684 -0.12243894  0.29940176 -0.11727002  0.02341542
 -0.40388682  0.37229824  0.18009521 -0.04752867 -0.1744095   0.14383641
  0.49741048  0.30134234 -0.24714772 -0.31985456  0.2178731  -0.11193708
 -0.08371949  0.11823459  0.03852409  0.328524    0.11106371  0.06652474
 -0.42896995  0.10074733  0.00410577 -0.25970107  0.03383336  0.27288067
 -0.0936849  -0.10606506 -0.21589877 -0.46932176 -0.10034835 -0.15499108
  0.22964504 -0.06628765 -0.18045184 -0.2491117   0.333259    0.22237097
 -0.41053253 -0.19132157 -0.10518885 -0.2265482   0.1

In [14]:
len(corpus_w2v)

11314

In [15]:
len(corpus_w2v_avg_clean)

11298

In [16]:
len(data_lemma2)

11298

In [17]:
model.wv.similar_by_vector(corpus_w2v_avg_clean[0])

[('car', 0.8668628931045532),
 ('friend', 0.5587462186813354),
 ('bike', 0.5526930093765259),
 ('get', 0.5419541597366333),
 ('buy', 0.5166246294975281),
 ('month', 0.4957149028778076),
 ('lot', 0.4892970025539398),
 ('see', 0.47523200511932373),
 ('light', 0.47500720620155334),
 ('couple', 0.46947041153907776)]

In [18]:
# Most simlar words to the document based on average representation
# This can be used to evaluate different aggregation methods and also provides interpretation of the document representation
print([token for (token,_) in model.wv.similar_by_vector(corpus_w2v_avg_clean[0])])

# cosine similarity to other documents
result=[(1 - cosine(corpus_w2v_avg_clean[0],corpus_w2v_avg_clean[i])) for i in range(1,len(corpus_w2v_avg_clean))]
most_similar=data_lemma2[result.index(max(result))+1]
print(data_lemma2[0])
print('')
print(most_similar)

['car', 'friend', 'bike', 'get', 'buy', 'month', 'lot', 'see', 'light', 'couple']
car wonder enlighten car see day door sport car look late early call bricklin door small addition bumper separate rest body know tellme model engine specs year production car history info funky looking car mail thank

aussie need info car show australia car enthusiast australia particularly interested american muscle car make amc ford chrysler mopar usa weeks june chicago sun thursday denver friday sunday austin texas monday friday oklahoma city friday monday anaheim california tuesday thursday las vegas nevada friday sunday grand canion monday tuesday june las angeles san diego vicinity wednesday june sunday june june south lake tahoe cal sunday june wednesday june reno thursday june san fransisco thursday june sunday june wonder send information car show swap meets drag meet model car show period anybody tell pomona swap meet year place visit car museum private collection collection bit information appr

In [19]:
len(result)

11297

In [20]:
# Corpus as data frame that can be used in downstream tasks such as classification
corpus_w2v_avg_df=pd.DataFrame(corpus_w2v_avg_clean)
corpus_w2v_avg_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,0.431838,0.554915,-0.473366,0.348348,-0.0184,-0.339263,-0.183222,0.476204,0.261883,0.10402,-0.463734,0.03434,-0.193403,-0.199116,-0.295663,0.107863,0.625886,0.189125,0.058026,-0.390009,-0.572118,-0.283647,-0.239579,-0.016996,0.056712,-0.304037,-0.582649,0.182191,0.191718,-0.246879,-0.07459,0.549232,-0.400909,-0.293866,-0.499848,-0.058122,0.041033,-0.380195,-0.023467,0.222867,...,-0.236732,0.134452,0.381511,-0.547509,-0.032724,0.080252,-0.278591,-0.450295,-0.566273,-0.369666,0.110985,0.126148,0.01642,0.200239,0.468257,-0.259085,-0.061002,0.181229,-0.05687,-0.376331,-0.049177,-0.148495,-0.138359,0.267975,-0.497886,0.058346,-0.20958,0.109051,0.330032,0.035222,0.05581,-0.292968,-0.611761,-0.077148,0.261068,-0.032985,0.787022,0.283526,-0.294245,-0.557787
1,0.565932,0.02202,-0.1587,0.265362,-0.227214,-0.505071,-0.028102,0.27917,-0.012853,0.123492,-0.127158,-0.125798,0.123329,0.084097,0.14663,-0.105508,-0.252665,0.235758,0.031799,-0.021608,-0.254935,-0.170193,-0.183221,0.292699,0.161802,0.090474,0.054026,-0.022362,-0.04438,-0.116532,0.137857,0.19245,0.027356,0.19049,0.272204,0.031263,0.084432,0.019332,0.000909,0.129401,...,-0.016898,0.0478,-0.032475,-0.046595,0.012133,0.253037,0.30954,0.282954,-0.220799,-0.313654,-0.324893,0.060379,0.243885,0.096802,-0.216924,-0.159189,-0.124687,-0.009255,-0.084051,-0.16977,0.211311,-0.141519,0.239964,0.036097,0.102962,-0.003999,-0.006897,0.158554,0.252246,0.051677,0.031839,0.211974,-0.426352,-0.131084,0.286821,-0.200112,0.249652,0.216993,0.027926,-0.435021
2,0.200032,0.08292,0.035689,0.056912,-0.071762,-0.456534,0.092231,0.182064,0.049008,-0.102243,-0.19155,-0.078892,-0.305604,0.159843,0.04658,-0.129662,0.114227,0.106086,-0.138237,-0.181097,-0.251417,0.003998,-0.21973,0.220798,-0.069019,0.252755,-0.16971,0.06551,-0.024233,-0.287221,-0.127491,0.365946,0.077198,-0.143094,0.005304,0.004201,0.188143,0.074862,0.12363,0.516551,...,-0.142152,-0.017457,-0.051287,0.040982,-0.208166,0.35493,0.076283,0.030184,-0.049415,-0.085586,-0.01466,-0.050442,0.302029,-0.136115,-0.09877,-0.281404,-0.16682,0.207283,-0.371864,-0.086545,0.04553,-0.114873,-0.054142,0.019535,-0.15233,-0.063802,0.149671,0.004373,0.125275,0.126435,-0.066584,0.020336,-0.120725,-0.117072,0.245615,-0.107552,0.147411,0.175608,0.069617,-0.267371
3,0.160912,0.136576,0.109901,-0.071512,-0.074774,-0.10122,-0.064318,0.103438,-0.13013,0.090833,0.05414,-0.158247,0.045144,-0.039138,0.066327,0.320677,-0.320255,0.137952,0.055627,-0.021837,0.014825,0.119302,0.069419,0.040153,0.236112,0.293467,-0.264766,-0.034988,-0.044967,-0.264113,0.006071,0.283053,-0.184789,0.084338,0.148789,0.102815,0.055481,0.117734,0.115995,-0.096065,...,-0.165093,0.377994,0.082963,-0.225038,-0.056618,0.057364,-0.186196,-0.097822,0.073737,-0.222788,-0.275767,-0.170693,0.452746,-0.057721,-0.362362,-0.190018,0.264088,0.366123,-0.094306,-0.055067,0.078254,-0.053449,-0.341444,-0.006161,-0.045779,-0.081027,0.377295,0.058653,0.013858,0.29227,0.026712,0.081584,0.024326,0.256704,0.065339,0.2923,0.202322,0.197504,0.279884,-0.18995
4,0.327761,0.182905,0.212345,0.060562,-0.148076,0.141264,0.255315,0.21546,-0.159637,0.119614,-0.012543,-0.212388,-0.026973,0.093226,0.107102,0.011576,-0.561428,0.141061,-0.155791,-0.044034,-0.021383,-0.078656,-0.236567,0.091189,0.290366,0.62366,-0.282102,-0.074444,-0.025233,-0.393009,0.043411,0.276344,0.180421,-0.139247,0.189232,0.226216,-0.052958,0.275459,0.142644,-0.071291,...,-0.11787,0.181709,-0.288382,-0.0579,-0.16809,0.168625,0.062552,0.396909,0.067817,0.029313,-0.272469,-0.180449,0.23986,0.006977,-0.189842,-0.355355,-0.073495,0.002169,0.062991,0.168059,0.183793,-0.383441,-0.470033,-0.110488,0.027732,0.138039,0.344597,0.038892,0.304522,0.409449,-0.203215,0.026743,0.121462,0.050136,0.359521,0.040295,-0.080378,-0.047023,0.453389,-0.132171


In [21]:
len(corpus_w2v_avg_df)

11298

In [23]:
pickle.dump(corpus_w2v_avg_df, open("/content/drive/MyDrive/Colab Notebooks/WordtoVecModel.pkl", "wb"))

## 2. Doc2Vec

In [24]:
# Run doc2vec on the tagged texts
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus_gen)]
model2 = Doc2Vec(documents, vector_size=100, min_count=566)

In [25]:
# Embedding for the first document
vector = model2.infer_vector(corpus_gen[0])
vector

array([ 0.04789714,  0.07085122,  0.06939712,  0.05581448, -0.00750191,
       -0.10455576, -0.04523741,  0.09602484,  0.09121901, -0.01692073,
       -0.07685312,  0.02729572, -0.0545367 ,  0.01446081, -0.05093332,
        0.08788661, -0.00073461,  0.05819418,  0.01522579,  0.06609607,
       -0.03178501, -0.00318004, -0.01149177, -0.006744  ,  0.02110712,
        0.03374053, -0.04595241,  0.01703607, -0.0561133 , -0.03547852,
        0.00472683,  0.07527841, -0.07168986, -0.01108005, -0.08198757,
        0.03246092,  0.08594696,  0.00258418,  0.01722172, -0.029672  ,
       -0.02956345, -0.18091534,  0.04489931,  0.03255635, -0.00533456,
        0.03040555, -0.10571268,  0.05677893,  0.06720766,  0.01656134,
        0.08023549,  0.01606058, -0.02798738,  0.04438192, -0.03808644,
       -0.0331932 , -0.01948151,  0.00779004, -0.05082292, -0.10607371,
       -0.07750382,  0.0319143 ,  0.03494918, -0.07328069,  0.02113496,
        0.0402196 , -0.04457681, -0.03642835,  0.05391117, -0.05

In [26]:

# cosine similarity to other documents
result=[(1 - cosine(vector,model2.infer_vector(corpus_gen[i]))) for i in range(1,len(corpus_gen))]
most_similar=data_lemma[result.index(max(result))+1]

print(most_similar)

mitsubishi hard drive help mitsubishi hard drive help hard disk new mitsubishi hard drive rll mfm storage format suspect switch setting move movement drive place switch setting drive switch switch select drive number info drive know number configure let know email cyl head think type thank advance chuck brown charles brown brown galois nscf org university brown moe coe uga edu augusta georgia cbrowni eis calstate edu


In [27]:
len(corpus_gen)

11314

In [28]:
# Final corpus for classification
corpus_d2v=pd.DataFrame([model2.infer_vector(doc) for doc in corpus_gen])

In [29]:
corpus_d2v.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,0.033106,0.098918,0.050923,0.045089,-0.01651,-0.083881,-0.041904,0.10257,0.084924,-0.019586,-0.05943,0.016196,-0.042996,-0.020928,-0.047943,0.086731,0.016809,0.059472,0.011163,0.031227,-0.02599,-0.030775,-0.013724,-0.004509,0.048144,0.027729,-0.058544,0.008638,-0.023791,-0.076936,-0.00491,0.090226,-0.069956,-0.016542,-0.077008,0.015614,0.040372,-0.013879,0.04178,-0.025325,...,-0.079405,0.034498,0.02061,-0.097783,0.00342,0.011522,-0.051112,-0.057745,0.030581,-0.049891,0.010007,0.04647,-0.015335,0.065181,-0.06226,-0.021381,-0.024897,0.058858,-0.043589,-0.083691,-0.017372,0.044659,-0.112701,-0.030502,-0.025539,0.064363,0.005199,0.063228,0.074967,0.040241,0.037232,0.067679,-0.034197,0.018707,0.096093,0.004697,0.055006,-0.011854,0.057327,-0.051704
1,0.087314,-0.006708,0.070973,-0.000895,-0.020421,-0.033985,-0.025585,0.037588,-0.011117,0.04684,-0.000195,-0.005179,0.021102,0.026539,-0.007511,-0.024511,-0.044508,0.009384,-0.017811,-0.020195,-0.035385,-0.076433,0.003944,0.04292,-0.014621,0.034137,0.034253,-0.009089,-0.022839,0.001922,-0.018337,-0.03138,0.021241,0.01952,0.00597,0.024516,0.096515,-0.001737,-0.063147,-0.039192,...,0.023904,0.017436,-0.040995,0.030023,0.03679,0.096343,0.101129,0.071277,0.014044,-0.089072,0.003651,-0.000497,0.025844,0.090702,-0.05528,-0.039763,-0.093893,0.003165,-0.066528,-0.009757,0.056503,-0.02554,0.063258,-0.053109,0.009131,0.00938,-0.088039,0.013424,0.05565,0.041001,0.021923,0.05434,-0.045752,-0.066931,0.036386,-0.060174,0.009829,0.025389,0.019687,-0.123803
2,-0.020082,0.076726,0.000217,-0.095554,0.054477,-0.048383,-0.001178,0.069049,0.104295,0.031209,-0.012774,0.028132,-0.190988,0.124783,0.084488,-0.035021,-0.004754,0.133061,-0.066553,-0.15008,-0.015618,0.068147,-0.024178,0.008021,-0.161173,0.062074,0.003902,0.023934,0.005674,-0.063869,-0.021318,0.042629,0.009544,-0.009217,0.016111,0.020191,0.041092,0.030274,-0.002066,0.144035,...,0.044755,-0.06101,0.048108,0.04538,-0.03504,0.238326,0.032068,0.082385,0.055611,-0.088862,0.081477,-0.064901,0.114603,-0.034756,0.050653,-0.014394,-0.071905,-0.007106,-0.117127,0.088055,0.075215,0.000832,0.041479,0.01491,0.021439,-0.051645,0.015267,0.004228,0.042628,0.021603,0.021316,0.080637,-0.065208,0.017536,0.069011,0.019527,-0.069017,0.079466,-0.083851,-0.029499
3,0.019127,0.00365,-0.003839,-0.007965,-0.035752,-0.015379,-0.002444,0.002232,-0.022033,0.022873,-0.049688,0.00281,0.021491,0.027247,0.010128,0.043042,-0.02325,-0.003554,0.049331,0.014763,0.048,0.008772,0.021482,-0.039993,-0.018215,0.012272,0.013622,0.00784,-0.006262,-0.016387,0.021047,0.003252,-0.023872,0.054374,0.018616,-0.008516,0.0165,0.01695,0.014876,0.001183,...,0.023895,0.054719,-0.006145,-0.011333,0.031115,-0.014977,-0.027866,0.009456,0.026328,0.010133,-0.036406,-0.019964,0.047808,-0.03197,-0.052114,-0.01323,0.05173,0.032968,0.009265,0.008906,-0.00772,0.047693,-0.029446,0.033402,-0.013943,0.000672,0.062104,-0.025988,-0.041461,0.031528,-0.030165,0.026802,-0.015122,0.072199,-0.053174,0.022593,0.036794,0.049933,-0.006916,0.01319
4,0.006459,0.025785,0.148373,0.046946,0.006658,-0.0202,0.017593,0.07086,0.058141,0.030598,-0.129147,0.01285,-0.015445,0.07905,-0.002163,0.033394,-0.124316,0.094877,0.072353,0.003525,0.005859,0.001051,-0.001674,-0.026987,0.024704,0.127752,-0.018889,-0.0029,-0.034714,-0.002294,0.03356,0.045775,0.009591,0.020055,-0.025792,0.001648,0.016852,0.068201,0.064958,0.003224,...,0.003139,0.039928,0.021083,-0.049624,0.027466,0.033412,0.056385,0.041078,0.073926,0.055372,0.053146,-0.05143,0.014331,0.043536,-0.048152,-0.096933,-0.08821,-0.000173,-0.057735,0.052815,-0.012372,0.092582,-0.08383,0.045299,0.016906,0.094267,0.002661,0.01515,0.097696,0.021847,-0.011965,0.040954,-0.032073,-0.018053,0.085566,0.026778,-0.019486,0.014699,0.056411,-0.076907


In [30]:
pickle.dump(corpus_d2v, open("/content/drive/MyDrive/Colab Notebooks/DoctoVecModel.pkl", "wb"))

## 3. BERT


Confirm that GPU is detected:

In [31]:
# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


Assign the GPU device to torch:

In [32]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


In order to apply BERT, we need to derive three data objects for the text data:
1. Add [CLS] at the beginning and [SEP] at the end of each text. [SEP] is a legacy from teh model training. The result for [CLS] is then used later as document representation for classification tasks.
2. Tokenize the texts using BERT tokenizer
3. Pad or truncate the text to the maximum length (maximum 512)
4. Map the remaining tokens to BERT dictionary 





In [33]:
# 1. Add [CLS] at the beginning and [SEP] at the end of each text.
sentences = ["[CLS] " + query + " [SEP]" for query in data_lemma]
print(sentences[0])

[CLS] car wonder enlighten car see day door sport car look late early call bricklin door small addition bumper separate rest body know tellme model engine specs year production car history info funky looking car mail thank [SEP]


In [34]:
# 2. Tokenize the texts using BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print(tokenized_texts[0])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…


['[CLS]', 'car', 'wonder', 'en', '##light', '##en', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'brick', '##lin', 'door', 'small', 'addition', 'bumper', 'separate', 'rest', 'body', 'know', 'tell', '##me', 'model', 'engine', 'spec', '##s', 'year', 'production', 'car', 'history', 'info', 'funky', 'looking', 'car', 'mail', 'thank', '[SEP]']


In [35]:
# Show token IDs based on BERT's training
print(tokenizer.convert_tokens_to_ids(tokenized_texts[0]))

[101, 2482, 4687, 4372, 7138, 2368, 2482, 2156, 2154, 2341, 4368, 2482, 2298, 2397, 2220, 2655, 5318, 4115, 2341, 2235, 2804, 21519, 3584, 2717, 2303, 2113, 2425, 4168, 2944, 3194, 28699, 2015, 2095, 2537, 2482, 2381, 18558, 24151, 2559, 2482, 5653, 4067, 102]


In order to determine the maximum sequence length, we look at the list statistics.

In [36]:
leng=[]
for t in tokenized_texts:
  leng.append(len(t))
df=pd.DataFrame(leng)
df.describe()

Unnamed: 0,0
count,11314.0
mean,170.791232
std,394.855169
min,4.0
25%,63.0
50%,103.0
75%,166.0
max,8235.0


In [37]:
df.quantile([.95, .99])

Unnamed: 0,0
0.95,412.0
0.99,1303.96


In [39]:
# 3. Pad the text to the maximum length, max 512

# Pad sequences that are less than MAX_LEN, if more, remove from the end
sentences_padded = pad_sequences(tokenized_texts,  dtype=object,maxlen=412,  value='[PAD]', truncating="post",padding="post")
print(sentences_padded[0])

['[CLS]' 'car' 'wonder' 'en' '##light' '##en' 'car' 'see' 'day' 'door'
 'sport' 'car' 'look' 'late' 'early' 'call' 'brick' '##lin' 'door' 'small'
 'addition' 'bumper' 'separate' 'rest' 'body' 'know' 'tell' '##me' 'model'
 'engine' 'spec' '##s' 'year' 'production' 'car' 'history' 'info' 'funky'
 'looking' 'car' 'mail' 'thank' '[SEP]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' 

In [40]:
#4. Map the tokens to BERT dictionary 
# Convert the tokens to their index numbers in the BERT vocabulary
sentences_converted = [tokenizer.convert_tokens_to_ids(s) for s in sentences_padded]
print(sentences_converted[0])

[101, 2482, 4687, 4372, 7138, 2368, 2482, 2156, 2154, 2341, 4368, 2482, 2298, 2397, 2220, 2655, 5318, 4115, 2341, 2235, 2804, 21519, 3584, 2717, 2303, 2113, 2425, 4168, 2944, 3194, 28699, 2015, 2095, 2537, 2482, 2381, 18558, 24151, 2559, 2482, 5653, 4067, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [41]:
# Create attention masks
masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in sentences_converted:
  seq_mask = [float(i>0) for i in seq]
  masks.append(seq_mask)

In [42]:
# 5. Generate embeddings

#Convert all of our data into torch tensors, the required datatype for our model

inputs = torch.LongTensor(sentences_converted)
masks = torch.LongTensor(masks)

In [43]:
inputs.size()

torch.Size([11314, 412])

In [44]:
masks.size()

torch.Size([11314, 412])

In [45]:
# Apply Pretrained model to the sentences
model = BertModel.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [46]:
model.to(device)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [47]:
# Set the batch size.  
batch_size = 16  

# Create the DataLoader.
prediction_data = TensorDataset(inputs, masks)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

In [48]:
result=[]
i=0
for batch in prediction_dataloader:
  #print(i)
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)


  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask = batch

  # Telling the model not to compute or store gradients, saving memory and 
  
  with torch.no_grad():
      # Forward pass, calculate embeddings
      outputs = model(b_input_ids)

  embeddings = outputs.pooler_output #CLS embeddings for the batch

  # Move em to CPU
  embeddings = embeddings.detach().cpu().numpy()
  
  # Store predictions and true labels
  result.append(embeddings)
  i=i+1


print('    DONE.')

    DONE.


In [50]:
#708 batches*16 texts with embedding size 768

In [51]:
final=[]
for b in result:
   for e in b:
      final.append(e)

In [52]:
# Final corpus
corpus_bert_df=pd.DataFrame(final)
corpus_bert_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767
0,-0.120891,-0.280994,-0.963245,0.419622,0.778366,-0.196418,-0.313293,0.23287,-0.862499,-0.955871,0.223536,0.886177,0.327486,0.866635,-0.249378,0.088374,0.439759,0.016313,0.064368,0.730549,0.462973,0.999983,-0.40831,0.307045,0.337647,0.908904,-0.068019,-0.055471,0.25283,0.328617,0.393255,0.119468,-0.622634,-0.270945,-0.978561,-0.187582,0.180457,0.12795,-0.101807,-0.25029,...,0.42801,-0.24705,-0.033118,-0.254504,-0.322066,-0.023203,-0.252481,-0.285563,0.193448,0.033795,0.999958,-0.736765,-0.848266,-0.186036,-0.336793,0.301686,-0.462011,-0.999997,0.255416,-0.859095,0.80337,-0.318131,0.902804,-0.754084,0.307099,-0.061251,0.62403,0.844899,-0.087035,-0.475938,0.161414,-0.928535,0.863065,-0.074633,-0.00226,-0.675741,0.185203,-0.855487,-0.17854,-0.240451
1,-0.183047,-0.320446,-0.967441,0.510457,0.806681,-0.212261,-0.239628,0.210489,-0.880438,-0.945462,0.122012,0.914262,0.214913,0.882294,-0.172117,0.000281,0.382229,0.015465,0.056818,0.700129,0.484552,0.999983,-0.460843,0.301718,0.323201,0.921945,-0.133789,-0.070382,0.238025,0.354671,0.378942,0.101619,-0.541697,-0.292166,-0.978894,-0.117593,0.185518,0.117982,-0.145663,-0.236753,...,0.481052,-0.270803,-0.04876,-0.217497,-0.346388,0.035731,-0.253299,-0.326463,0.175167,-0.009337,0.999946,-0.775669,-0.875688,-0.17833,-0.360491,0.283788,-0.45156,-0.999997,0.240596,-0.880769,0.822632,-0.391792,0.911289,-0.800638,0.229179,-0.054229,0.605605,0.866382,-0.10912,-0.466791,0.17515,-0.939977,0.888713,-0.089563,-0.091914,-0.731656,0.169998,-0.887561,-0.124455,-0.239807
2,-0.448309,-0.47241,-0.966038,0.469346,0.783993,-0.309353,-0.07145,0.402272,-0.900948,-0.996757,-0.014901,0.911494,0.588971,0.86902,0.128399,-0.312896,0.298809,-0.298606,0.229429,0.730347,0.562585,0.999993,-0.376672,0.494585,0.460912,0.939935,-0.356034,0.246502,0.618172,0.554519,0.120343,0.322657,-0.796389,-0.403809,-0.979325,-0.658267,0.306179,-0.15493,-0.216211,-0.144139,...,0.30866,-0.323581,-0.213666,-0.265178,-0.020348,-0.309221,-0.446614,-0.392979,0.400869,0.209144,0.99998,-0.796421,-0.905167,-0.253032,-0.453571,0.496375,-0.544815,-1.0,0.28958,-0.917304,0.874065,-0.410036,0.887686,-0.891297,-0.080705,-0.25599,0.608703,0.891756,-0.33259,-0.503152,0.638434,-0.900726,0.892726,0.026648,-0.237032,-0.610219,0.718728,-0.91158,-0.425955,-0.05975
3,-0.098515,-0.385608,-0.987904,0.494115,0.869421,-0.238608,-0.245373,0.274326,-0.947566,-0.950335,-0.018988,0.949433,-0.0147,0.939705,-0.249542,-0.130132,0.267692,-0.008533,0.047951,0.734438,0.511568,0.999997,-0.642022,0.352938,0.385997,0.963973,-0.198282,-0.205098,0.149731,0.367538,0.343082,0.175255,-0.408339,-0.325079,-0.989443,-0.010332,0.279057,0.168347,-0.1956,-0.278317,...,0.652145,-0.299753,-0.139638,-0.173683,-0.443671,-0.027985,-0.349703,-0.388918,0.165237,0.035473,0.999991,-0.882225,-0.953739,-0.24563,-0.413192,0.379641,-0.529157,-1.0,0.287768,-0.922833,0.899929,-0.592388,0.953749,-0.889039,0.282065,-0.129607,0.664973,0.92745,-0.143531,-0.5332,0.266515,-0.961368,0.948298,-0.195683,-0.299023,-0.849391,0.298872,-0.932983,-0.178504,-0.309659
4,-0.180124,-0.413501,-0.984492,0.474194,0.865759,-0.281325,-0.225921,0.305099,-0.935964,-0.974516,0.008402,0.932913,0.18234,0.928759,-0.195707,-0.15407,0.334482,-0.087378,0.10497,0.755661,0.508255,0.999997,-0.580859,0.43008,0.409025,0.948664,-0.238847,-0.117083,0.270822,0.436732,0.326114,0.217338,-0.518259,-0.354096,-0.987442,-0.17187,0.272495,0.096164,-0.216041,-0.242049,...,0.606793,-0.327609,-0.150611,-0.184259,-0.386002,-0.137363,-0.39325,-0.38306,0.206357,0.095422,0.999989,-0.869633,-0.944107,-0.264682,-0.423192,0.436482,-0.529781,-1.0,0.28089,-0.92591,0.897921,-0.559192,0.944515,-0.896119,0.235804,-0.174416,0.641793,0.917698,-0.214031,-0.523474,0.428578,-0.948966,0.931038,-0.180842,-0.408895,-0.821256,0.473733,-0.940421,-0.238981,-0.261727


In [54]:
pickle.dump(corpus_bert_df, open("/content/drive/MyDrive/Colab Notebooks/BertModel.pkl", "wb"))