# What is Word Embeding?
* **Representing text as numbers:** Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing we must do come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model.

In general, there are 3 strategies for doing so:
* **One-hot encodings**:

![alt text](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/one-hot.png?raw=1)

* **Encode each word with a unique number**

the -> 0
cat -> 1
mat -> 2
on  -> 3

* **Word embeddings**: Dense Vector Representation using floating point values which are trainable parameters.

![alt text](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/embedding2.png?raw=1)


# References:

* [TF word embedding tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings)
* [Word Embedding Example](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)

* [Tokenazation](#https://www.kdnuggets.com/2020/03/tensorflow-keras-tokenization-text-data-prep.html)



In [1]:
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install transformers requests beautifulsoup4 pandas numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m79.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re

In [4]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

In [5]:
tokens = tokenizer.encode('It was good but couldve been better. Great', return_tensors='pt')

In [6]:
result = model(tokens)

In [7]:
int(torch.argmax(result.logits))+1

4

In [8]:
from numpy import array
import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SpatialDropout1D, Dropout, Convolution1D
from tensorflow.keras.layers import Flatten,  LSTM, GlobalMaxPooling1D

from tensorflow.keras.layers import Embedding
import numpy as np
import pandas as pd
import urllib

---
# Corpus

In [9]:
def loadFile(url):
  stms =[]
  file = urllib.request.urlopen(url)

  for line in file:
    line = line.decode("utf-8")
    if(len(line)>2):
          stm =  line.strip()
          #print(stm)
          stms.append(stm)
  return stms


In [None]:
urlWish = ''
urlCurse= ''


In [10]:


wish = "/content/olumlu.txt"
curse ="/content/olumsuz.txt"

totalWish=len(wish)
print('totalWish: ',totalWish)
totalCurse = len(curse)
print('totalCurse: ',totalCurse)

totalWish:  19
totalCurse:  20


In [11]:
curse= curse[:totalWish]
totalCurse = len(curse)
print('totalCurse: ',totalCurse)

totalCurse:  19


In [12]:
testWish= int(totalWish* 0.1)
testCurse = int(totalCurse * 0.1)
print('testWish ', testWish)
print('testCurse ', testCurse)

trainDocs= wish[:-testWish]+curse[:-testCurse]
testDocs= wish[-testWish:]+curse[-testCurse:]
print(len(trainDocs))
print(len(testDocs))

trainLabels = np.concatenate((np.ones(totalWish-testWish),np.zeros(totalCurse-testCurse)), axis=0)
testLabels = np.concatenate((np.ones(testWish),np.zeros(testCurse)), axis=0)

print(len(trainLabels))
print(len(testLabels))

testWish  1
testCurse  1
36
2
36
2


In [13]:
allDocs= trainDocs + testDocs
print(allDocs)
print(len(allDocs))

/content/olumlu.tx/content/olumsuz.ttx
38


---
# Tokenize the corpus

In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer
# Tokenize our training data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(allDocs)

document_count = tokenizer.document_count
vocab_size = len(tokenizer.word_index)

# Encode training data sentences into sequences
allDocs_sequences = tokenizer.texts_to_sequences(allDocs)

# Get max training sequence length
max_length = max([len(x) for x in allDocs_sequences])

# Get our training data word index
word_index = tokenizer.word_index
print("Corpus Summary")
print("Word index:", word_index)
print("document count  :", document_count)
print("vocabulary size :", vocab_size)
print("Maximum length of the statements :", max_length)

Corpus Summary
Word index: {'t': 1, 'o': 2, 'n': 3, 'u': 4, 'l': 5, 'c': 6, 'e': 7, 'm': 8, 'x': 9, 's': 10, 'z': 11}
document count  : 38
vocabulary size : 11
Maximum length of the statements : 1


In [15]:
# Encode training data sentences into sequences
train_sequences = tokenizer.texts_to_sequences(trainDocs)

# Pad the training sequences
train_padded = pad_sequences(train_sequences, padding='post', truncating='post', maxlen=max_length)

# Output the results of our work
print("Train Doc Summary")
print("\nTraining sequences:\n", train_sequences)
print("\nPadded training sequences:\n", train_padded[:5])
print("\nPadded training shape:", train_padded.shape)
print("Training sequences data type:", type(train_sequences))
print("Padded Training sequences data type:", type(train_padded))

Train Doc Summary

Training sequences:
 [[], [6], [2], [3], [1], [7], [3], [1], [], [2], [5], [4], [8], [5], [4], [], [1], [9], [], [6], [2], [3], [1], [7], [3], [1], [], [2], [5], [4], [8], [10], [4], [11], [], [1]]

Padded training sequences:
 [[0]
 [6]
 [2]
 [3]
 [1]]

Padded training shape: (36, 1)
Training sequences data type: <class 'list'>
Padded Training sequences data type: <class 'numpy.ndarray'>


In [16]:
# Encode training data sentences into sequences
test_sequences = tokenizer.texts_to_sequences(testDocs)

# Pad the training sequences
test_padded = pad_sequences(test_sequences, padding='post', truncating='post', maxlen=max_length)

# Output the results of our work
print("Test Doc Summary")
print("\nTest sequences:\n", test_sequences)
print("\nPadded test sequences:\n", test_padded[:5])
print("\nPadded test shape:", test_padded.shape)
print("Test sequences data type:", type(test_sequences))
print("Padded Test sequences data type:", type(test_padded))

Test Doc Summary

Test sequences:
 [[1], [9]]

Padded test sequences:
 [[1]
 [9]]

Padded test shape: (2, 1)
Test sequences data type: <class 'list'>
Padded Test sequences data type: <class 'numpy.ndarray'>


---
# Model 1: Vanilla Deep NN

In [17]:
#@title ENTER EPOCH
epochs =  100#@param {type:"integer"}


In [18]:
# define the model
model1 = Sequential()
model1.add(Dense(8, input_shape=(max_length,)))
#model.add(Flatten())
model1.add(Dense(64, activation='relu'))
model1.add(Dense(128, activation='relu'))
model1.add(Dense(64, activation='relu'))
model1.add(Dense(32, activation='relu'))
model1.add(Dense(1, activation='sigmoid'))
# compile the model
model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model1.summary())
# fit the model
model1.fit(train_padded, trainLabels, epochs=epochs, verbose=0)
# evaluate the model
loss, accuracy = model1.evaluate(test_padded, testLabels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 8)                 16        
                                                                 
 dense_1 (Dense)             (None, 64)                576       
                                                                 
 dense_2 (Dense)             (None, 128)               8320      
                                                                 
 dense_3 (Dense)             (None, 64)                8256      
                                                                 
 dense_4 (Dense)             (None, 32)                2080      
                                                                 
 dense_5 (Dense)             (None, 1)                 33        
                                                                 
Total params: 19,281
Trainable params: 19,281
Non-traina

---
# Model 2: Deep NN with Word Embedding





tf.keras.layers.Embedding(
    **input_dim,** **output_dim,** embeddings_initializer='uniform',
    embeddings_regularizer=None, activity_regularizer=None,
    embeddings_constraint=None, mask_zero=False, **input_length=**None, **kwargs
)

In [19]:
input_dim = vocab_size+1
output_dim = 8

# define the model
model2 = Sequential()
model2.add(Embedding(input_dim, output_dim, input_length=max_length, name= 'embeded'))
model2.add(Flatten())
model2.add(Dense(32, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))

# compile the model
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print(model2.summary())

# fit the model
model2.fit(train_padded, trainLabels, epochs=epochs, verbose=0)

# evaluate the model
loss, accuracy = model2.evaluate(test_padded, testLabels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embeded (Embedding)         (None, 1, 8)              96        
                                                                 
 flatten (Flatten)           (None, 8)                 0         
                                                                 
 dense_6 (Dense)             (None, 32)                288       
                                                                 
 dense_7 (Dense)             (None, 1)                 33        
                                                                 
Total params: 417
Trainable params: 417
Non-trainable params: 0
_________________________________________________________________
None
Accuracy: 0.000000


---
# Model with Word Embedding + LSTM

In [20]:
input_dim = vocab_size+1
output_dim = 8

# define the model
model3 = Sequential()
model3.add(Embedding(input_dim, output_dim, input_length=max_length, name= 'embeded'))
model3.add(SpatialDropout1D(0.25))
model3.add(LSTM(16, return_sequences=True))
model3.add(LSTM(8))
model3.add(Dropout(0.25))
model3.add(Dense(1, activation='sigmoid'))

# compile the model
model3.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print(model3.summary())

# fit the model
model3.fit(train_padded, trainLabels, epochs=epochs, verbose=0)

# evaluate the model
loss, accuracy = model3.evaluate(test_padded, testLabels, verbose=0)
print('Accuracy: %f' % (accuracy*100))


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embeded (Embedding)         (None, 1, 8)              96        
                                                                 
 spatial_dropout1d (SpatialD  (None, 1, 8)             0         
 ropout1D)                                                       
                                                                 
 lstm (LSTM)                 (None, 1, 16)             1600      
                                                                 
 lstm_1 (LSTM)               (None, 8)                 800       
                                                                 
 dropout (Dropout)           (None, 8)                 0         
                                                                 
 dense_8 (Dense)             (None, 1)                 9         
                                                      

---
# Model with Word Embedding + Convolution1D

In [15]:
input_dim = vocab_size+1
output_dim = 8

# define the model
model4 = Sequential()
model4.add(Embedding(input_dim, output_dim, input_length=max_length, name= 'embeded'))
model4.add(Dropout(0.50))
model4.add(Convolution1D(16,3))
model4.add(Convolution1D(16,5))
model4.add(GlobalMaxPooling1D())
model4.add(Dropout(0.50))
model4.add(Dense(16, activation='relu'))
model4.add(Dense(1, activation='sigmoid'))

# compile the model
model4.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print(model4.summary())

# fit the model
model4.fit(train_padded, trainLabels, epochs=epochs, verbose=0)

# evaluate the model
loss, accuracy = model4.evaluate(test_padded, testLabels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

ValueError: ignored

---
# More models: Embedding + Conv1D+ LSTM + Attention



*Do it yourself :)*

---
# Some free text



* 'gözlerin dert görmesin'
* 'gözlerin görmesin'
*  'gün görmesin'
* 'yüzün gün görmesin'
* 'ellerin dert görmesin'
* 'dert görmesin'
* 'gözlerin kör olsun'
* 'hayırlı olsun'
* 'hayır olmasın'
* 'belanı göresin'
* 'belanı görmeyesin'
* 'kör ol inşallah'
* 'mutlu ol inşallah'
* 'cennetlik ol inşallah'
* 'toprak ol inşallah'
* 'kısmetin bol olsun inşallah'
* 'kısmetin yok olsun inşallah'

In [47]:
r = requests.get('https://www.doktorsitesi.com/dyt-dilara-yuksel/diyetisyen/istanbul')
soup = BeautifulSoup(r.text, 'html.parser')
ad = re.compile('.*18*.')
results = soup.find_all('h1', {'data-expert-id':ad})
doktor = [result.text for result in results]

regex = re.compile('.*op-message-item*.')
results = soup.find_all('p', {'class':regex})
yorumlar = [result.text for result in results]


In [48]:
statement = yorumlar[3]

In [54]:
sayi=0

In [55]:
for i in yorumlar:
    statement = i


    sayi= sayi+1
    myTest=[statement]
    myTestEncoded= tokenizer.texts_to_sequences(myTest)
    print (myTestEncoded)
    # Pad the training sequences
    myTestPadded = pad_sequences(myTestEncoded, padding='post', truncating='post', maxlen=max_length)
    print (myTestPadded)
    print(sayi)
    print("yorum:", statement)
    print("Deep NN model ", 'OLUMLU' if model1.predict(myTestPadded)[0][0]> 0.5 else 'OLUMSUZ', model1.predict(myTestPadded)[0][0])
    print("Word Embedding ", 'OLUMLU' if model2.predict(myTestPadded)[0][0]> 0.5 else 'OLUMSUZ', model2.predict(myTestPadded)[0][0])
    print("Word Embedding + LSTM ", 'OLUMLU' if model3.predict(myTestPadded)[0][0]> 0.5 else 'OLUMSUZ', model3.predict(myTestPadded)[0][0])


[[]]
[[0]]
1
yorum: 
Harika bir diyetisyen çok güzel enerjisi var. Kısıtlamadan kilo veriyorum destekleriyle 💕



Deep NN model  OLUMLU 0.5181714
Word Embedding  OLUMLU 0.50653595
Word Embedding + LSTM  OLUMSUZ 0.49821675
[[]]
[[0]]
2
yorum: 
Bayılıyorum enerjisine enerjiniz tükendiğinde size hiç kötü hissettirmeyip hatta daha güzel hissettiriyor iyi ki🫶💙



Deep NN model  OLUMLU 0.5181714
Word Embedding  OLUMLU 0.50653595
Word Embedding + LSTM  OLUMSUZ 0.49821675
[[]]
[[0]]
3
yorum: 
Anlayışlı , güler yüzlü , danışanlarını dinleyen , elinden tutan güzel bir insan kendisi .



Deep NN model  OLUMLU 0.5181714
Word Embedding  OLUMLU 0.50653595
Word Embedding + LSTM  OLUMSUZ 0.49821675
[[]]
[[0]]
4
yorum: 
Tek kelimeyle süper bir doktor   nazıkve açıklayıcı davranısı



Deep NN model  OLUMLU 0.5181714
Word Embedding  OLUMLU 0.50653595
Word Embedding + LSTM  OLUMSUZ 0.49821675
[[]]
[[0]]
5
yorum: 
Mükemmel bir deneyimdi. Açıklayıcı ve samimi bir doktor.



Deep NN model  OLUMLU 0.5181714
W

In [56]:
statement = "merhabalar iyi ki varsınız"
myTest=[statement]
myTestEncoded= tokenizer.texts_to_sequences(myTest)
print (myTestEncoded)
# Pad the training sequences
myTestPadded = pad_sequences(myTestEncoded, padding='post', truncating='post', maxlen=max_length)
print (myTestPadded)
print("yorum:", statement)
print("Deep NN model ", 'OLUMLU' if model1.predict(myTestPadded)[0][0]> 0.5 else 'OLUMSUZ', model1.predict(myTestPadded)[0][0])
print("Word Embedding ", 'OLUMLU' if model2.predict(myTestPadded)[0][0]> 0.5 else 'OLUMSUZ', model2.predict(myTestPadded)[0][0])
print("Word Embedding + LSTM ", 'OLUMLU' if model3.predict(myTestPadded)[0][0]> 0.5 else 'OLUMSUZ', model3.predict(myTestPadded)[0][0])


[[]]
[[0]]
yorum: merhabalar iyi ki varsınız
Deep NN model  OLUMLU 0.5181714
Word Embedding  OLUMLU 0.50653595
Word Embedding + LSTM  OLUMSUZ 0.49821675


---
# Visualize the embedding

## Save the word vectors and words

In [57]:
e= model3.get_layer(name='embeded')
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

(12, 8)


In [58]:
import io
file_vec = 'vecs_'+str(epochs)+'.tsv'
file_meta= 'meta_'+str(epochs)+'.tsv'
out_v = io.open(file_vec, 'w', encoding='utf-8')
out_m = io.open(file_meta, 'w', encoding='utf-8')

for num, word in enumerate(tokenizer.word_index):
  vec = weights[num+1] # skip 0, it's padding.
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()
out_m.close()

## Download 2 files

In [62]:
try:
  from google.colab import files
except ImportError:
   pass
else:
  files.download(file_vec)
  files.download(file_meta)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


## Open http://projector.tensorflow.org/