<a href="https://colab.research.google.com/github/PeterPirog/cars-regression/blob/main/Text_Vectorization_Use_Save_Upload.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KERAS TEXT VECTORIZATION LAYER: USE, SAVE, AND UPLOAD

**Author:** [Murat Karakaya](https://www.linkedin.com/in/muratkarakaya/)<br>
**Date created:** 05 Oct 2021<br>
**Last modified:** 24 Oct 2021<br>
**Description:** This is a new part of the "**[tf.keras.layers: Understand & Use](https://www.youtube.com/playlist?list=PLQflnv_s49v_7WIgOo9mVKptLZHyOYysD)**" / "[**tf.keras.layers: Anla ve Kullan**](https://www.youtube.com/playlist?list=PLQflnv_s49v9h85zD1_GDfTxZOrCWTDhp)" series. In this part, we will build, adapt, use, save, and upload the Keras TextVectorization layer. 

We will download a [Kaggle Dataset](https://www.kaggle.com/savasy/multiclass-classification-data-for-turkish-tc32?select=ticaret-yorum.csv) in which there are 32 topics and more than 400K total reviews. 
In this tutorial, we will use this dataset for a multi class text classification task.

Our **main aim** is to learn how to efectively use the Keras `TextVectorization` layer in practice.

The tutorial has 5 parts:

* **PART A: BACKGROUND**
* **PART B: KNOW THE DATA**
* **PART C: USE KERAS TEXT VECTORIZATION LAYER**
* **PART D: BUILD AN END-TO-END MODEL**
* **PART E: SUMMARY**


At the end of this tutorial, we will cover:
* What a Keras `TextVectorization` layer is
* Why we need to use a Keras `TextVectorization` layer in Natural Languge Processing (NLP) tasks
* How to employ a Keras `TextVectorization` layer in **Text Preprocessing**
* How to integrate a Keras `TextVectorization` layer to a trained model
* How to save and upload a Keras `TextVectorization` layer and a model with a Keras `TextVectorization` layer
* How to integrate a Keras `TextVectorization` layer with **TensorFlow Data Pipeline** API (`tf.data`)
* How to design, train, save, and load an End-to-End model using Keras `TextVectorization` layer

**Accessible on:**
* [YouTube in English](https://youtube.com/playlist?list=PLQflnv_s49v8Eo2idw9Ju5Qq3JTEF-OFW)
* [YouTube in Turkish](https://youtube.com/playlist?list=PLQflnv_s49v8-xeTLx1QmuE-YkRB4bToF)
* [Medium](https://kmkarakaya.medium.com/text-vectorization-use-save-upload-54d65945d222)
* [Github pages](https://kmkarakaya.github.io/Deep-Learning-Tutorials/)
* [Github Repo](https://github.com/kmkarakaya/Deep-Learning-Tutorials)
* [Google Colab](https://colab.research.google.com/drive/1_hiUXcX6DwGEsPP2iE7i-HAs-5HqQrSe?usp=sharing)



# REFERENCES
* [Keras Preprocessing layers by Keras.io](https://keras.io/api/layers/preprocessing_layers/)
* [Text classification from scratch by Keras.io](https://keras.io/examples/nlp/text_classification_from_scratch/)
* [TextVectorization layer by Keras.io](https://keras.io/api/layers/preprocessing_layers/text/text_vectorization/)

# **PART A: BACKGROUND**

# 1 TERMS & CONCEPTS

## 1.1 What is Text Vectorization?

Text Vectorization is the process of converting text into numerical representation. 

There are many different techniques proposed to convert text to a numerical form such as:
* One-hot Encoding (OHE)
* Count Vectorizer
* Bag-of-Words (BOW)
* N-grams
* Term Frequency
* Term Frequency-Inverse Document Frequency (TF-IDF)
* Embeddings



## 1.2. What is Text Preprocessing?
Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more suitable form so that Machine Learning or Deep Learning algorithms can perform better.

The main phases of Text preprocessing:
* **Noise Removal** (cleaning) – Removing unnecessary characters and formatting
* **Tokenization** – break multi-word strings into smaller components
* **Normalization** – a catch-all term for processing data; this includes stemming and lemmatization


Some of the common **Noise Removal** (cleaning) steps are:

* Removal of Punctuations
* Removal of Frequent words
* Removal of Rare words
* Removal of emojis
* Removal of emoticons
* Conversion of emoticons to words
* Conversion of emojis to words
* Removal of URLs
* Removal of HTML tags
* Chat words conversion
* Spelling correction

**Tokenization** is about splitting strings of text into smaller pieces, or “tokens”. Paragraphs can be tokenized into sentences and sentences can be tokenized into words. 


**Noise Removal** and **Tokenization** and  are staples of almost all text pre-processing pipelines. However, some data may require further processing through text **normalization**. Some of the common **normalization** steps are:
* Upper or lowercasing
* Stopword removal
* Stemming – bluntly removing prefixes and suffixes from a word
* Lemmatization – replacing a single-word token with its root



## 1.3. What is Keras Text Vectorization layer?
 

`tf.keras.layers.TextVectorization` layer is one of the [Keras Preprocessing layers](https://keras.io/guides/preprocessing_layers/). 

We can preproces the input by using different libraries such as Python String library, or SciKit Learn library, etc. 

However, there are very important advantages using the [Keras Preprocessing layers](https://keras.io/guides/preprocessing_layers/):

* You can build **Keras-native** input processing **pipelines**. These input processing pipelines can be used as **independent** preprocessing code in **non-Keras workflows**, combined directly with Keras models, and exported as part of a Keras SavedModel.

* You can build and **export** models that are **truly end-to-end**: models that accept **raw data** (images or raw structured data) as input; models that handle feature **normalization** or feature value **indexing** on their own.

Today, we will deal with the `tf.keras.layers.TextVectorization` layer which:
* turns ***raw strings*** into an **encoded representation** 
* that representation can be read by an `Embedding` layer or `Dense` layer.

That is, the `tf.keras.layers.TextVectorization` layer can be used in 
* **Text Preprocessing** and
* **Text Vectorization**

# 2. IMPORT LIBRARIES

**IMPORTANT:** When I prepared this tutorial on 05 Oct 2021, the current version (2.6.0) of TF and Keras generate some **errors** in saving and uploading the **tf.keras.layers.TextVectorization layer**. 

However, the nightly version has no problem handling these operations.

For more information about the bug, please see [here](https://github.com/keras-team/keras/issues/15443#issuecomment-938211510)



```python
import tensorflow as tf

from tensorflow import keras

print("tf version:",tf.__version__)

print("keras version:", keras.__version__)

tf version: 2.6.0

keras version: 2.6.0
```

Therefore, below I first upload the TF nightly version. 

```python
tf version: 2.8.0-dev20211005
keras version: 2.7.0
```

In [None]:
pip install tf-nightly --quiet --upgrade

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import os

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import re
import string
import random
from sklearn.model_selection import train_test_split

In [None]:
print("tf version:",tf.__version__)
print("keras version:", keras.__version__)

tf version: 2.8.0-dev20211203
keras version: 2.8.0


In [None]:
#@title Record Each Cell's Execution Time
!pip install ipython-autotime

%load_ext autotime

Collecting ipython-autotime
  Downloading ipython_autotime-0.3.1-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.3.1
time: 204 µs (started: 2021-12-03 14:12:16 +00:00)


# 3. DOWNLOAD A KAGGLE DATASET INTO GOOGLE COLAB

The [Multi Class Classification Dataset for Turkish](https://www.kaggle.com/savasy/multiclass-classification-data-for-turkish-tc32?select=ticaret-yorum.csv) is a **benchmark dataset for Turkish** **text classification** task. 

It contians 430K comments/reviews for a total 32 categories products or services.

Each category roughly has 13K comments.

A baseline algoritm, Naive Bayes, gets %84 F1 score.




[My blog post explaning how to download Kaggle Datasets is here.](https://medium.com/analytics-vidhya/how-to-fetch-kaggle-datasets-into-google-colab-ea682569851a)

My video tutorial explaning how to download Kaggle Datasets is here: [Turkish](https://youtu.be/ls47CPFU1vE)/[English](https://youtu.be/_rlt4mzLDLc)



In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
time: 1.66 ms (started: 2021-12-03 14:12:16 +00:00)


In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/Colab Notebooks/input"

time: 850 µs (started: 2021-12-03 14:12:16 +00:00)


In [None]:
#changing the working directory
%cd "/content/gdrive/MyDrive/Colab Notebooks/input"


/content/gdrive/MyDrive/Colab Notebooks/input
time: 3.1 ms (started: 2021-12-03 14:12:16 +00:00)


In [None]:
#get the api command from kaggle dataset page
#!kaggle datasets download -d savasy/multiclass-classification-data-for-turkish-tc32

time: 570 µs (started: 2021-12-03 14:12:16 +00:00)


In [None]:
# check the downloaded zip file
!ls 

120001_PH1.csv	generatedReviews.csv	    kaggle.json        tr_stop_word.txt
320d.csv	generatedReviews_final.csv  model.png	       vocabPickle
corona.csv	generatedReviews_plus.csv   ticaret-yorum.csv
time: 152 ms (started: 2021-12-03 14:12:16 +00:00)


In [None]:
# unzipping the zip files and deleting the zip files
!unzip \*.zip  && rm *.zip

unzip:  cannot find or open *.zip, *.zip.zip or *.zip.ZIP.

No zipfiles found.
time: 139 ms (started: 2021-12-03 14:12:16 +00:00)


In [None]:
# check the downloaded csv file
!ls 

120001_PH1.csv	generatedReviews.csv	    kaggle.json        tr_stop_word.txt
320d.csv	generatedReviews_final.csv  model.png	       vocabPickle
corona.csv	generatedReviews_plus.csv   ticaret-yorum.csv
time: 136 ms (started: 2021-12-03 14:12:16 +00:00)


# 4. LOAD STOP WORDS IN TURKISH

As you might know "**Stop words**" are a set of commonly used words in a language. Examples of stop words in **English** are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to **eliminate** words that are so commonly used that they carry **very little useful information**.

I begin with uploading an existing  list of stop words in Turkish below:

In [None]:
tr_stop_words = pd.read_csv('tr_stop_word.txt',header=None)
for each in tr_stop_words.values[:5]:
  print(each[0])

ama
amma
anca
ancak
bu
time: 13.2 ms (started: 2021-12-03 14:12:17 +00:00)


# 5. LOAD THE DATASET
After downloading the dataset from Kaggle website, we can upload it by using the Pandas library `read_csv()` function:

In [None]:
data = pd.read_csv('ticaret-yorum.csv')
pd.set_option('max_colwidth', 400)

time: 5.34 s (started: 2021-12-03 14:12:17 +00:00)


# **PART B: KNOW THE DATA**

# 6. EXPLORE THE DATASET

Before getting into the details of how to use the `tf.keras.layers.TextVectorization` layer, let me introduce the dataset briefly.

## Shuffle Data

It is a really good and useful habit that, before doing anything else, as a first step in the preprocessing shuffle the data!

Actually, I will shuffle the data at the last step of the pipeline.
But it does not hurt shuffling it twice :))


In [None]:
data= data.sample(frac=1)

time: 98.4 ms (started: 2021-12-03 14:12:22 +00:00)


## Summary Information about the dataset

Get the initial information about the dataset:

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 431306 entries, 71903 to 142963
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   category  431306 non-null  object
 1   text      431306 non-null  object
dtypes: object(2)
memory usage: 9.9+ MB
time: 107 ms (started: 2021-12-03 14:12:22 +00:00)


We have a total of **431306** of rows and **2** columns: ***category*** & ***text***.

According to `data.info()`, there is **no null values** in the dataset. If there are any null values in the dataset, we could drop these null values as follows:
```python
df.dropna(inplace=True)

df.isnull().sum()
```

## Sample Reviews and their categories:

In [None]:
data.head()

Unnamed: 0,category,text
71903,egitim,"Benim Hocam Yayınevi 'den Yanlış Beyan,Benim Hocam Tarih soru bankasının kapağında tamamı çözümlü yazdığı için aldım ama 4 test kadar ilerleyince fark ettim ki tamamı çözümlü değil her konunun son testleri çözümsüz bence bu yapılan ‘beyanda sahtecilik’ hiç yakıştıramadım bu yayınevine başka bir kaynağını satın almayı düşünmüyorum.Devamını oku"
260078,kucuk-ev-aletleri,"ECA Petek Ses Yapıyor,Geçen yıl yeni yaptırmış olduğumuz doğalgaz tesisatında ECA marka petekler kullanıldı. Peteklerden birinden sürekli tak tuk sesler geliyor. İlgili servis önce ses duymadığını sonra da sorunu ECA merkeze ilettiklerini bir sonuç alınamadığını söyledi. Yapıldığı günden beri bu şekilde arızalı bir ürünü...Devamını oku"
46503,bilgisayar,"TP-Link W9970V3 Modem Isınma Ve Sararma!,""TP-Link TD-W9970v3 modelini 1 yıldır kullanıyorum."
263875,kucuk-ev-aletleri,"Philips Ütü Patladı Patlayacak!,""Görselde ki ütüyü 5-6 ay önce, Beylikdüzü 5m Migros Philips bayiinden aldık. Aldığımızdan beri sürekli sorun çıkartıyor, aldığımız yere götürdük fakat bir problem olmadığını iddia ediyorlar."
109121,enerji,"İgdaş Fatura İtirazı Sonucu!,""Faturam 164 TL geldi ocak ayında bile bu kadar fatura gelmedi. Araştırılıp para iademi istiyorum. Corona'dan dolayı çalışamıyoruz bir de sürekli mesaj atıp atıp duruyorsunuz gecikme zammı alınacak diye."


time: 19.1 ms (started: 2021-12-03 14:12:22 +00:00)


# 7. CREATE A TENSORFLOW DATA PIPELINE FOR TEXT PREPROCESSING &  VECTORIZATION

So far, we just observe some properties of the **raw data**.
Using these observations, we are ready to preprocess the `text` data for a classifier model.

Below, we will begin to create a **TensorFlow data pipeline** which includes **Keras Text Vectorization layer** for preprocessing the data and preparing it for a classifier.

A pipeline for a text model mostly involves extracting symbols from raw text data, converting them to embedding identifiers with a lookup table, and batching together sequences of different lengths.

In this tutorial, I will use the TensorFlow "**tf.data**" API. If you are not familiar with TF data pipeline "**tf.data**" API, you can apply below resources:
* Official TensorFlow blog: [tf.data: Build TensorFlow input pipelines](https://www.tensorflow.org/guide/data) 
* The Murat Karakaya Akademi YouTube playlist in Turkish: [tf.data: TensorFlow Data Pipeline Anlamak ve Kullanmak](https://www.youtube.com/playlist?list=PLQflnv_s49v8l8dYU01150vcoAn4sWSAm)  
* The Murat Karakaya Akademi YouTube playlist in English:[TensorFlow Data Pipeline: How to Design Code Use TensorFlow Data Pipelines with Python & Keras](https://www.youtube.com/playlist?list=PLQflnv_s49v_m6KLMsORgs9hVIvDCwDAb)
* The Murat Karakaya Akademi Medium blog: [tf.data: Tensorflow Data Pipelines](https://medium.com/deep-learning-with-keras/tf-data-tensorflow-data-pipelines-71915155bdf2)


## Convert Categories From Strings to Integer Ids

Observe that the categories (topics/class)of the reviews are **strings**:

In [None]:
data["category"]

71903                        egitim
260078            kucuk-ev-aletleri
46503                    bilgisayar
263875            kucuk-ev-aletleri
109121                       enerji
                    ...            
164858                        giyim
419530                       ulasim
247721    kisisel-bakim-ve-kozmetik
163826                        giyim
142963                       finans
Name: category, Length: 431306, dtype: object

time: 8.82 ms (started: 2021-12-03 14:12:22 +00:00)


We nned to create **integer** category **ids** from **string** category **names** by adding a new column to the dataframe "**category_id**":

In [None]:
data["category"] = data["category"].astype('category')
data["category_id"] = data["category"].cat.codes
data.head()

Unnamed: 0,category,text,category_id
71903,egitim,"Benim Hocam Yayınevi 'den Yanlış Beyan,Benim Hocam Tarih soru bankasının kapağında tamamı çözümlü yazdığı için aldım ama 4 test kadar ilerleyince fark ettim ki tamamı çözümlü değil her konunun son testleri çözümsüz bence bu yapılan ‘beyanda sahtecilik’ hiç yakıştıramadım bu yayınevine başka bir kaynağını satın almayı düşünmüyorum.Devamını oku",5
260078,kucuk-ev-aletleri,"ECA Petek Ses Yapıyor,Geçen yıl yeni yaptırmış olduğumuz doğalgaz tesisatında ECA marka petekler kullanıldı. Peteklerden birinden sürekli tak tuk sesler geliyor. İlgili servis önce ses duymadığını sonra da sorunu ECA merkeze ilettiklerini bir sonuç alınamadığını söyledi. Yapıldığı günden beri bu şekilde arızalı bir ürünü...Devamını oku",19
46503,bilgisayar,"TP-Link W9970V3 Modem Isınma Ve Sararma!,""TP-Link TD-W9970v3 modelini 1 yıldır kullanıyorum.",3
263875,kucuk-ev-aletleri,"Philips Ütü Patladı Patlayacak!,""Görselde ki ütüyü 5-6 ay önce, Beylikdüzü 5m Migros Philips bayiinden aldık. Aldığımızdan beri sürekli sorun çıkartıyor, aldığımız yere götürdük fakat bir problem olmadığını iddia ediyorlar.",19
109121,enerji,"İgdaş Fatura İtirazı Sonucu!,""Faturam 164 TL geldi ocak ayında bile bu kadar fatura gelmedi. Araştırılıp para iademi istiyorum. Corona'dan dolayı çalışamıyoruz bir de sürekli mesaj atıp atıp duruyorsunuz gecikme zammı alınacak diye.",8


time: 73.5 ms (started: 2021-12-03 14:12:22 +00:00)


Lastly, we can check the number of categories. Note that it should be **32**: 

In [None]:
data['category']

71903                        egitim
260078            kucuk-ev-aletleri
46503                    bilgisayar
263875            kucuk-ev-aletleri
109121                       enerji
                    ...            
164858                        giyim
419530                       ulasim
247721    kisisel-bakim-ve-kozmetik
163826                        giyim
142963                       finans
Name: category, Length: 431306, dtype: category
Categories (32, object): ['alisveris', 'anne-bebek', 'beyaz-esya', 'bilgisayar', ..., 'spor',
                          'temizlik', 'turizm', 'ulasim']

time: 9.33 ms (started: 2021-12-03 14:12:22 +00:00)


## Build a Dictionary for id to text category (topic) look-up:

In [None]:
id_to_category = pd.Series(data.category.values,index=data.category_id).to_dict()
id_to_category

{0: 'alisveris',
 1: 'anne-bebek',
 2: 'beyaz-esya',
 3: 'bilgisayar',
 4: 'cep-telefon-kategori',
 5: 'egitim',
 6: 'elektronik',
 7: 'emlak-ve-insaat',
 8: 'enerji',
 9: 'etkinlik-ve-organizasyon',
 10: 'finans',
 11: 'gida',
 12: 'giyim',
 13: 'hizmet-sektoru',
 14: 'icecek',
 15: 'internet',
 16: 'kamu-hizmetleri',
 17: 'kargo-nakliyat',
 18: 'kisisel-bakim-ve-kozmetik',
 19: 'kucuk-ev-aletleri',
 20: 'medya',
 21: 'mekan-ve-eglence',
 22: 'mobilya-ev-tekstili',
 23: 'mucevher-saat-gozluk',
 24: 'mutfak-arac-gerec',
 25: 'otomotiv',
 26: 'saglik',
 27: 'sigortacilik',
 28: 'spor',
 29: 'temizlik',
 30: 'turizm',
 31: 'ulasim'}

time: 82 ms (started: 2021-12-03 14:12:22 +00:00)


In [None]:
pwd

'/content/gdrive/My Drive/Colab Notebooks/input'

time: 5.12 ms (started: 2021-12-03 14:30:38 +00:00)


In [None]:
import pickle
pkl_file = open("id_to_category.pkl", "wb")
pickle.dump(id_to_category, pkl_file)
pkl_file.close()

pkl_file = open("id_to_category.pkl", "rb")
uploaded_id_to_category = pickle.load(pkl_file)
print(uploaded_id_to_category)

{5: 'egitim', 19: 'kucuk-ev-aletleri', 3: 'bilgisayar', 8: 'enerji', 28: 'spor', 1: 'anne-bebek', 31: 'ulasim', 2: 'beyaz-esya', 21: 'mekan-ve-eglence', 0: 'alisveris', 17: 'kargo-nakliyat', 11: 'gida', 30: 'turizm', 9: 'etkinlik-ve-organizasyon', 25: 'otomotiv', 7: 'emlak-ve-insaat', 16: 'kamu-hizmetleri', 15: 'internet', 13: 'hizmet-sektoru', 27: 'sigortacilik', 20: 'medya', 29: 'temizlik', 12: 'giyim', 6: 'elektronik', 18: 'kisisel-bakim-ve-kozmetik', 24: 'mutfak-arac-gerec', 4: 'cep-telefon-kategori', 26: 'saglik', 14: 'icecek', 10: 'finans', 23: 'mucevher-saat-gozluk', 22: 'mobilya-ev-tekstili'}
time: 8.92 ms (started: 2021-12-03 14:32:12 +00:00)


## Reduce the Size of the Dataset

Since using a large dataset for **testing** your pipeline would take more time, you would prefer **take a portion** of the raw dataset as below:

In [None]:
#limit the number of samples to be used in testing the pipeline
#data_size= 1000 #instead of 431306 
#data= data[:data_size]
#data.info()

time: 1.55 ms (started: 2021-10-08 14:40:16 +00:00)


## Split the Raw Dataset into Train and Test Datasets

To prevent **data leakage** during preprocessing the text data, we need to split the text int Train and Test data sets. 

**Data leakage** refers to a mistake make by the creator of a machine learning model in which they accidentally share information between the test and training data-sets. Typically, when splitting a data-set into testing and training sets, the goal is to ensure that no data is shared between the two. This is because the test set’s purpose is to simulate real-world, unseen data. However, when evaluating a model, we do have full access to both our train and test sets, so it is up to us to ensure that no data in the training set is present in the test set.

In our case, since we want to classify reviews, we have **not to use** test reviews in **text vectorization**.

In [None]:
# save features and targets from the 'data'
features, targets = data['text'], data['category_id']

train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        random_state=42,
        shuffle = True,
        stratify=targets
    )

time: 286 ms (started: 2021-10-08 14:40:16 +00:00)


# Build the Train & Test TensorFlow Datasets

First, we create **TensorFlow Datasets** from the raw Train Dataframe for further processing.

Note that:
1. **X**: input (text/reviews)
2. **y**: target value (categories/topics/class)

**Observe that** we have **reviews in text** as input and **categories (topics) in integer** as target values:

In [None]:
train_features.values[:5]

array(['İçim Kaşar Peynir İçinden Yeşil Madde,Kaşar peynirin içinden maydanoza benzer yeşil bir madde çıktı biz bunu fark etmeden yiyebiliriz de lütfen yetkililerden bir açıklama bekliyorum bu gıda maddesinin içinde ne gibi bir madde olabilir. Bize nasıl ortamlarda ürettiğiniz ürünleri yediriyorsunuz kesinlikle küf değil fotoğrafını da ekliyoru...Devamını oku',
       'Philips TV İnternet Bağlantı Sorunu!,"Philips 32PFS5803/62 model Smart televizyonumu Vatan markete henüz 1 ay oldu alalı 1 ay her yere bağlanan TV internete bağlı olmasına rağmen YouTube.com, Smart TV, uygulama galerisi vb... Hiçbir uygulamayı açmıyor. Girmeye çalıştığım zaman ""bu TV\'yi internete bağlayın"" sayfası açılıyor ve bağlamaya ...Devamını oku"',
       'Anadolu Hastanesi (Çanakkale) Muayene Süresi Kısalığı,20 aylık çocuğum var devamlı çocuk Dr. y. A muayene oluyorum ama artık aynı sorunla karşılaşmaktan bıktım. Alel acele 5 dakikada muayene yapıor hastanın çıkmasını beklemeden yeni hasta alıyor ve onun yanınd

time: 7.27 ms (started: 2021-10-08 14:40:16 +00:00)


In [None]:
train_targets.values[:5]

array([11,  6, 26, 20, 19], dtype=int8)

time: 5.68 ms (started: 2021-10-08 14:40:16 +00:00)


## Prepare TensorFlow Datasets

We convert the data stored in Pandas Data Frame into  a data stored in TensorFlow Data Set as below:

In [None]:
# train X & y
train_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(train_features.values, tf.string)
) 
train_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(train_targets.values, tf.int64),

) 
# test X & y
test_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(test_features.values, tf.string)
) 
test_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(test_targets.values, tf.int64),

) 

time: 1.81 s (started: 2021-10-08 14:40:16 +00:00)


## Decide the dictionary size and the review size

For preprocessing the text, we need to decide the **dictionary (vocabulary) size** and the **review (text) length**.


In [None]:
vocab_size = 20000  # Only consider the top 20K words
max_len = 50  # Maximum review (text) size in words

# **PART C: USE KERAS TEXT VECTORIZATION LAYER**

# 8. PREPROCESS THE TEXT WITH THE KERAS `TEXTVECTORIZATION` LAYER



## 8.1. Define your own `custom_standardization` function
First, I define a function which will preprocess the given text.
The `custom_standardization` function will convert the given string to a standart form by transforming the input applying several updates:
* convert all characters to lowercase
* remove special symbols, extra spaces, html tags, digits, and puctuations
* remove stop wrods
* replace the special Turkish letters with the corresponding English letters.

In [None]:
@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    no_stars = tf.strings.regex_replace(no_uppercased, "\*", " ")
    no_repeats = tf.strings.regex_replace(no_stars, "devamını oku", "")    
    no_html = tf.strings.regex_replace(no_repeats, "<br />", "")
    no_digits = tf.strings.regex_replace(no_html, "\w*\d\w*","")
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    #remove stop words
    no_stop_words = ' '+no_punctuations+ ' '
    for each in tr_stop_words.values:
      no_stop_words = tf.strings.regex_replace(no_stop_words, ' '+each[0]+' ' , r" ")
    no_extra_space = tf.strings.regex_replace(no_stop_words, " +"," ")
    #remove Turkish chars
    no_I = tf.strings.regex_replace(no_extra_space, "ı","i")
    no_O = tf.strings.regex_replace(no_I, "ö","o")
    no_C = tf.strings.regex_replace(no_O, "ç","c")
    no_S = tf.strings.regex_replace(no_C, "ş","s")
    no_G = tf.strings.regex_replace(no_S, "ğ","g")
    no_U = tf.strings.regex_replace(no_G, "ü","u")

    return no_U

Quickly verify that `custom_standardization` works: try it on a sample Turkish input:

In [None]:
input_string = "Bu Issız Öğlenleyin de;  şunu ***1 Pijamalı Hasta***, ve  Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku"
print("input:  ", input_string)
output_string= custom_standardization(input_string)
print("output: ", output_string.numpy().decode("utf-8"))

input:   Bu Issız Öğlenleyin de;  şunu ***1 Pijamalı Hasta***, ve  Ancak İşte Yağız Şoföre Çabucak Güvendi...Devamını oku
output:   issiz oglenleyin pijamali hasta i̇ste yagiz sofore cabucak guvendi 
time: 58.8 ms (started: 2021-10-08 14:40:18 +00:00)


## 8.2. Configure the Keras `TextVectorization` layer

To preprocess the text, I will use the Keras `TextVectorization` layer. 

```python
tf.keras.layers.TextVectorization(
    max_tokens=None,
    standardize="lower_and_strip_punctuation",
    split="whitespace",
    ngrams=None,
    output_mode="int",
    output_sequence_length=None,
    pad_to_max_tokens=False,
    vocabulary=None,
    **kwargs
)
```

The Keras `TextVectorization` layer processes each example in the dataset as follows:

1. Standardize each example (usually lowercasing + punctuation stripping)

2. Split each example into substrings (usually words)

3. Recombine substrings into tokens (usually ngrams)

4. Index tokens (associate a unique int value with each token)

5. Transform each example using this index, either into a vector of ints or a dense float vector.




Let's build our `TextVectorization` layer by providing:

1. The `custom_standardization()` function for the `standardize` method (callable).
2. The `vocab_size` as the `max_tokens` number: The `max_tokens` is the maximum size of the vocabulary that will be created from the dataset. If `None`, there is no cap on the size of the vocabulary. Note that this vocabulary contains 1 **OOV (Out Of Vocabulary)** token, so the effective number of tokens is (max_tokens - 1 - (1 if output_mode == "int" else 0)).
3. The `int` keyword as the `output_mode`: Optional specification for the **output** of the layer. Values can be 
* "**int**", 
* "**multi_hot**", 
* "**count**" or 
* "**tf_idf**", 

Configuring the layer as follows: 
* "**int**": Outputs integer indices, one integer index per split string token. When output_mode == "int", 0 is reserved for masked locations; this reduces the vocab size to max_tokens - 2 instead of max_tokens - 1.

* "**multi_hot**": Outputs a single int array per batch, of either vocab_size or max_tokens size, containing 1s in all elements where the token mapped to that index exists at least once in the batch item. 

* "**count**": Like "multi_hot", but the int array contains a count of the number of times the token at that index appeared in the batch item. 

* "**tf_idf**": Like "multi_hot", but the TF-IDF algorithm is applied to find the value in each token slot. 

For "**int**" output, any shape of input and output is supported. 

For **all other output modes**, currently only **rank 1 inputs** (and rank 2 outputs after splitting) are supported. 


4. output_sequence_length=max_len

In [None]:
# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size+2,
    output_mode="int",
    output_sequence_length=max_len,
)

## 8.3. Adapt the Keras `TextVectorization` layer with the **training** data set, (not test data set!) 

`TextVectorization` preprocessing layer has an internal state that can be computed based on a sample of the training data. That is, `TextVectorization` holds a **mapping** between **string** tokens and integer **indices**.

Thus, we will ***adopt*** `TextVectorization` preprocessing layer **ONLY** to the **training** data.


**Please note that:** To prevent and data leak, we **DO NOT** adopt `TextVectorization` preprocessing layer to the **whole** (***train & test***) data.

In [None]:
vectorize_layer.adapt(train_features)
vocab = vectorize_layer.get_vocabulary()  # To get words back from token indices

time: 2min 22s (started: 2021-10-08 14:40:18 +00:00)


Let's see some example conversions:

In [None]:
print("vocab has the ", len(vocab)," entries")
print("vocab has the following first 10 entries")
for word in range(10):
  print(word, " represents the word: ", vocab[word])

for X in train_features[:2]:
  print(" Given raw data: " )
  print(X)
  tokenized = vectorize_layer(tf.expand_dims(X, -1))
  print(" Tokenized and Transformed to a vector of integers: " )
  print (tokenized)
  print(" Text after Tokenized and Transformed: ")
  transformed = ""
  for each in tf.squeeze(tokenized):
    transformed= transformed+ " "+ vocab[each]
  print(transformed)

vocab has the  20002  entries
vocab has the following first 10 entries
0  represents the word:  
1  represents the word:  [UNK]
2  represents the word:  ne
3  represents the word:  tl
4  represents the word:  gun
5  represents the word:  urun
6  represents the word:  aldim
7  represents the word:  siparis
8  represents the word:  musteri
9  represents the word:  tarihinde
 Given raw data: 
İçim Kaşar Peynir İçinden Yeşil Madde,Kaşar peynirin içinden maydanoza benzer yeşil bir madde çıktı biz bunu fark etmeden yiyebiliriz de lütfen yetkililerden bir açıklama bekliyorum bu gıda maddesinin içinde ne gibi bir madde olabilir. Bize nasıl ortamlarda ürettiğiniz ürünleri yediriyorsunuz kesinlikle küf değil fotoğrafını da ekliyoru...Devamını oku
 Tokenized and Transformed to a vector of integers: 
tf.Tensor(
[[ 3451  3133  1770  1566  1605  1709  3133  6372   640     1  2025  1605
   1709    64   209  2335     1  4024   853   184  1037     1    72     2
   1709   623   177     1 18408   367    

In [None]:
vocab[:5]

['', '[UNK]', 'ne', 'tl', 'gun']

time: 4.75 ms (started: 2021-10-08 14:42:41 +00:00)


## 8.4. Save & Upload TextVectorization layer

Due to the facts that adapting the Keras `TextVectorization` layer on a large text dataset takes considerable amount of time and porting the adapted layer to a different deployment environment is a high possibility, it is good to know how to save and load it.

How to save a Keras `TextVectorization` layer? 

[There are currently 2 ways of doing it](https://stackoverflow.com/questions/65103526/how-to-save-textvectorization-to-disk-in-tensorflow):
* save the Keras `TextVectorization` layer in a Keras Model
* save the Keras `TextVectorization` layer as a pickle file.

In this tutorial, I will use the first approach as it is native to the TF/Keras environment.



### 8.4.1. Ensure that you are on the correct directory path :)

In [None]:
%cd ../models/
%ls

/content/gdrive/My Drive/Colab Notebooks/models
[0m[01;34mcheckpoint[0m/                    [01;34mMultiClassTextClassificationExported[0m/
[01;34mend_to_end_model[0m/              [01;34mMultitopicTextGenerator[0m/
[01;34mMultiClassTextClassification[0m/  [01;34mvectorize_layer_model[0m/
time: 366 ms (started: 2021-10-08 14:42:41 +00:00)


### 8.4.2. Create a temporary Keras `model` by adding the adapted Keras `TextVectorization` layer

In [None]:
# Create model.
vectorize_layer_model = tf.keras.models.Sequential()
vectorize_layer_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
vectorize_layer_model.add(vectorize_layer)
vectorize_layer_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 50)               0         
 torization)                                                     
                                                                 
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________
time: 256 ms (started: 2021-10-08 14:42:41 +00:00)


## 8.4.3. Save the temporary model including the adapted Keras `TextVectorization` layer

In [None]:
filepath = "vectorize_layer_model"

time: 721 µs (started: 2021-10-08 14:42:42 +00:00)


In [None]:
vectorize_layer_model.save(filepath, save_format="tf")

INFO:tensorflow:Assets written to: vectorize_layer_model/assets
time: 4.86 s (started: 2021-10-08 14:42:42 +00:00)


In [None]:
%ls 

[0m[01;34mcheckpoint[0m/                    [01;34mMultiClassTextClassificationExported[0m/
[01;34mend_to_end_model[0m/              [01;34mMultitopicTextGenerator[0m/
[01;34mMultiClassTextClassification[0m/  [01;34mvectorize_layer_model[0m/
time: 153 ms (started: 2021-10-08 14:42:46 +00:00)


### 8.4.4. Load the `vectorize_layer_model` back to chek if saving is succesfull

In [None]:
loaded_vectorize_layer_model = tf.keras.models.load_model(filepath)

time: 1.93 s (started: 2021-10-08 14:42:47 +00:00)


### 8.4.5 Retrieve the **loaded** Keras `TextVectorization` layer

Here, you have 2 options:
* use the `loaded_model.predicted()` method to use the Keras `TextVectorization` layer, or
* get the Keras `TextVectorization` layer out of the `loaded_model` as below:



In [None]:
loaded_vectorize_layer = loaded_vectorize_layer_model.layers[0]

time: 1.97 ms (started: 2021-10-08 14:42:49 +00:00)


### 8.4.6. Compare the original and loaded `TextVectorization` layers

In [None]:
loaded_vocab=loaded_vectorize_layer.get_vocabulary()
print("original vocab has the ", len(vocab)," entries")
print("loaded vocab has the   ", len(loaded_vocab)," entries")
print("loaded vocab has the following first 10 entries")
for word in range(10):
  print(word, " represents the word: ")
  print(vocab[word], " in original vocab")
  print(loaded_vocab[word], " in loaded vocab")
for X in train_features[:1]:
  print(" Given raw data: " )
  print(X)

  tokenized = vectorize_layer(tf.expand_dims(X, -1))
  print(" Tokenized and Transformed to a vector of integers by the original vectorize layer:" )
  print (tokenized)

  tokenized = loaded_vectorize_layer(tf.expand_dims(X, -1))
  print(" Tokenized and Transformed to a vector of integers by the loaded vectorize layer:" )
  print (tokenized)
  
  tokenized = loaded_vectorize_layer_model.predict(tf.expand_dims(X, -1))
  print(" Tokenized and Transformed to a vector of integers by the loaded_vectorize_layer_model:" )
  print (tokenized)

  print(" Text after Tokenized and Transformed by the original vectorize layer:: ")
  transformed = ""
  for each in tf.squeeze(tokenized):
    transformed= transformed+ " "+ vocab[each]
  print(transformed)

  print(" Text after Tokenized and Transformed by the loaded vectorize layer:")
  transformed = ""
  for each in tf.squeeze(tokenized):
    transformed= transformed+ " "+ loaded_vocab[each]
  print(transformed)

original vocab has the  20002  entries
loaded vocab has the    20002  entries
loaded vocab has the following first 10 entries
0  represents the word: 
  in original vocab
  in loaded vocab
1  represents the word: 
[UNK]  in original vocab
[UNK]  in loaded vocab
2  represents the word: 
ne  in original vocab
ne  in loaded vocab
3  represents the word: 
tl  in original vocab
tl  in loaded vocab
4  represents the word: 
gun  in original vocab
gun  in loaded vocab
5  represents the word: 
urun  in original vocab
urun  in loaded vocab
6  represents the word: 
aldim  in original vocab
aldim  in loaded vocab
7  represents the word: 
siparis  in original vocab
siparis  in loaded vocab
8  represents the word: 
musteri  in original vocab
musteri  in loaded vocab
9  represents the word: 
tarihinde  in original vocab
tarihinde  in loaded vocab
 Given raw data: 
İçim Kaşar Peynir İçinden Yeşil Madde,Kaşar peynirin içinden maydanoza benzer yeşil bir madde çıktı biz bunu fark etmeden yiyebiliriz de l

As you see above, we succesfully saved and loaded the *adapted* Keras `TextVectorization` layer!

We can continue to the TensorFlow datapipeline with the **adapted** Keras `TextVectorization` layer:

In [None]:
pwd

'/content/gdrive/My Drive/Colab Notebooks/models'

time: 11.9 ms (started: 2021-10-08 14:42:49 +00:00)


# 9. APPLY KERAS `TEXTVECTORIZATION` TO TRAIN & TEST DATA SETS 

We can define a function to apply the Keras `TextVectorization` on a given string as follows:

In [None]:
def convert_text_input(sample):
    text = sample
    text = tf.expand_dims(text, -1)  
    #return tf.squeeze(vectorize_layer(text))
    return tf.squeeze(loaded_vectorize_layer(text)) 

time: 1.48 ms (started: 2021-10-08 14:42:49 +00:00)


We use the TensorFlow `tf.data` API (TF Data Pipeline) `map()` funtion to apply `convert_text_input()` on every sample in the `text` column (reviews) of the training dataset.

In [None]:
# Train X
train_text_ds = train_text_ds_raw.map(convert_text_input, 
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)
# Test X
test_text_ds = test_text_ds_raw.map(convert_text_input, 
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)

time: 696 ms (started: 2021-10-08 14:42:49 +00:00)


Let's see the converted/encoded texts (reviews)

In [None]:
for each in train_text_ds.take(3):
  print(each)

tf.Tensor(
[ 3451  3133  1770  1566  1605  1709  3133  6372   640     1  2025  1605
  1709    64   209  2335     1  4024   853   184  1037     1    72     2
  1709   623   177     1 18408   367     1   282  2582  3586     1     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0], shape=(50,), dtype=int64)
tf.Tensor(
[  226    44   354  1078    17   226   215   206  9049  1079  2556   460
    11   574    11    19   294 13253    44  2481  1384   124   648   141
   206    44   672     1  2262    22  2862   890  5564  2058    67    44
   469  2481     1  4955  1862 15099     0     0     0     0     0     0
     0     0], shape=(50,), dtype=int64)
tf.Tensor(
[  465   171  3144   673   378     1   192  1280    10  1273   414  1023
   695    74   673  3805   102    25  1777     1  1706     1  6537  2406
   673     1  8569  9825  9478    79  1001   788   975   414     1   348
     1    28   348 13025    10 11558     1     0     0     0     0     0
     0   

10. GENERATE THE TRAIN SET BY COMBINING X & Y:
* **X**: the preprocessed & encoded reviews 
* **y**: encoded categories) 

In [None]:
train_ds = tf.data.Dataset.zip(
    (
            train_text_ds,
            train_cat_ds_raw
     )
) 

time: 3.9 ms (started: 2021-10-08 14:42:50 +00:00)


Similarly, let's bundle test data sets as a single data set:

In [None]:
test_ds = tf.data.Dataset.zip(
    (
            test_text_ds,
            test_cat_ds_raw
     )
) 

time: 1.7 ms (started: 2021-10-08 14:42:50 +00:00)


We can see the result of the **Text Vectorization** in the **Data Pipelining** as follows:


In [None]:
for X,y in train_ds.take(1):
  print("input (review) X.shape: ", X.shape)
  print("output (category) y.shape: ", y.shape)
  print("input (review) X: ", X)
  print("output (category) y: ",y)
  input = " ".join([vocab[_] for _ in np.squeeze(X)])
  output = id_to_category[y.numpy()]
  print("X: input (review) in text: " , input)
  print("y: output (category) in text: " , output)

input (review) X.shape:  (50,)
output (category) y.shape:  ()
input (review) X:  tf.Tensor(
[ 3451  3133  1770  1566  1605  1709  3133  6372   640     1  2025  1605
  1709    64   209  2335     1  4024   853   184  1037     1    72     2
  1709   623   177     1 18408   367     1   282  2582  3586     1     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0], shape=(50,), dtype=int64)
output (category) y:  tf.Tensor(11, shape=(), dtype=int64)
X: input (review) in text:  i̇cim kasar peynir i̇cinden yesil madde kasar peynirin icinden [UNK] benzer yesil madde cikti fark etmeden [UNK] yetkililerden aciklama bekliyorum gida [UNK] icinde ne madde olabilir bize [UNK] urettiginiz urunleri [UNK] kesinlikle kuf fotografini [UNK]               
y: output (category) in text:  gida
time: 167 ms (started: 2021-10-08 14:42:50 +00:00)


# 11. FINALIZE TENSORFLOW DATA PIPELINE
Finalize TensorFlow Data Pipeline by setting necessary parameters for batching, shuffling , and optimizing as follows:



In [None]:
batch_size = 64
AUTOTUNE = tf.data.experimental.AUTOTUNE
buffer_size= train_ds.cardinality().numpy()

train_ds = train_ds.shuffle(buffer_size=buffer_size)\
                   .batch(batch_size=batch_size,drop_remainder=True)\
                   .cache()\
                   .prefetch(AUTOTUNE)

test_ds = test_ds.shuffle(buffer_size=buffer_size)\
                   .batch(batch_size=batch_size,drop_remainder=True)\
                   .cache()\
                   .prefetch(AUTOTUNE)

time: 17.7 ms (started: 2021-10-08 14:42:50 +00:00)


In [None]:
train_ds.element_spec

(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
 TensorSpec(shape=(64,), dtype=tf.int64, name=None))

time: 5.28 ms (started: 2021-10-08 14:42:50 +00:00)


# **PART D: BUILD AN END-TO-END MODEL**

# 12. Create a Classification Model

For the sake of demonstration of the Keras `TextVectorization` layer, let's build a very simple model:

In [None]:
def create_model():
    inputs_tokens = layers.Input(shape=(max_len,), dtype=tf.int32)
    embedding_layer = layers.Embedding(vocab_size, 256)
    x = embedding_layer(inputs_tokens)
    x = layers.Flatten()(x)
    outputs = layers.Dense(32)(x)
    model = keras.Model(inputs=inputs_tokens, outputs=outputs)
    
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    metric_fn  = tf.keras.metrics.SparseCategoricalAccuracy()
    model.compile(optimizer="adam", loss=loss_fn, metrics=metric_fn)  
    
    return model
my_model=create_model()
my_model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 50)]              0         
                                                                 
 embedding (Embedding)       (None, 50, 256)           5120000   
                                                                 
 flatten (Flatten)           (None, 12800)             0         
                                                                 
 dense (Dense)               (None, 32)                409632    
                                                                 
Total params: 5,529,632
Trainable params: 5,529,632
Non-trainable params: 0
_________________________________________________________________
time: 54.3 ms (started: 2021-10-08 14:42:51 +00:00)


# 13. Train the Classification Model

In [None]:
my_model.fit(train_ds, verbose=1, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fef7fe05050>

time: 6min 30s (started: 2021-10-08 14:42:51 +00:00)


In [None]:
loss, accuracy = my_model.evaluate(test_ds)
print("Train accuracy: ", accuracy)

Train accuracy:  0.9511414170265198
time: 55.6 s (started: 2021-10-08 14:49:21 +00:00)


# 14. An End-To-End Classification Model

Pay attention that the above model is expected to receive batches of integer tensors as input:

```
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 50)]              0         
```
Thus, you can NOT supply raw data (some text) to the model for prediction. TensorFlow/Keras would generate error message as below:



```python
raw_data=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']

predictions=my_model.predict(raw_data)

ValueError: in user code: Exception encountered when calling layer "model" (type Functional).
    
    Input 0 of layer "dense" is incompatible with the layer: expected axis -1of input shape to have value 12800, but received input with shape (None, 256)
    
    Call arguments received:
      • inputs=tf.Tensor(shape=(None,), dtype=string)
      • training=False
      • mask=None

```



However, sometimes it a big advantage if we can design a model which accepts raw data as input, then, process the data by itself.

For example such a model can be easily exported different platforms/environments without the need of exporting the preprocess code!

Therefore, Keras provides [several Preprocessing Layers](https://keras.io/api/layers/preprocessing_layers/) so that we can integrate preprocessing logic as a layer into a Keras model.

After then, we can export such models and use any other platforms without re-writing preprocessing code on the exported platforms/environments.

This kind of models can be called **End-To-End Models**. That is, an **End-To-End model** can accept Raw Input Data and preprocess it by itself.

**What could be Raw Data? **

It could be:
* text
* image
* structure data
* etc.

Let's create an **End-To-End Classification Model** by integrating the **adapted** Keras `TextVectorization` layer into the **trained model** as **the first layer**. 

You can create an **End-To-End Model** either by:
* Keras Sequential API, or
* Keras Functional API 

## 14.1. Create an End-To-End Model with Keras Sequential API

In [None]:
end_to_end_model = tf.keras.Sequential([
  keras.Input(shape=(1,), dtype="string"),
  vectorize_layer,
  my_model,
  layers.Activation('softmax')
])

end_to_end_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)
end_to_end_model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 50)               0         
 torization)                                                     
                                                                 
 model (Functional)          (None, 32)                5529632   
                                                                 
 activation (Activation)     (None, 32)                0         
                                                                 
Total params: 5,529,632
Trainable params: 5,529,632
Non-trainable params: 0
_________________________________________________________________
time: 282 ms (started: 2021-10-08 14:50:16 +00:00)


## 14.2. Create an End-To-End Model with Keras Functional API

In [None]:
inputs = keras.Input(shape=(1,), dtype="string")
x = vectorize_layer(inputs)
outputs = my_model(x)
end_to_end_model = keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)
end_to_end_model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 50)               0         
 torization)                                                     
                                                                 
 model (Functional)          (None, 32)                5529632   
                                                                 
Total params: 5,529,632
Trainable params: 5,529,632
Non-trainable params: 0
_________________________________________________________________
time: 284 ms (started: 2021-10-08 14:50:17 +00:00)


## 14.3. Test the End-to-End model with Raw (Text) Data

In [None]:
raw_data=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']
predictions=end_to_end_model.predict(raw_data)
print(id_to_category[np.argmax(predictions[0])])
print(id_to_category[np.argmax(predictions[1])])

alisveris
ulasim
time: 608 ms (started: 2021-10-08 14:50:17 +00:00)


In [None]:
loss, accuracy = end_to_end_model.evaluate(test_features,test_targets)
print("end_to_end_model accuracy: ", accuracy)

end_to_end_model accuracy:  0.9511488080024719
time: 46.8 s (started: 2021-10-08 14:50:18 +00:00)


## 14.4. Save the End-to-End model

In [None]:
end_to_end_model.save("end_to_end_model")

INFO:tensorflow:Assets written to: end_to_end_model/assets
time: 5.58 s (started: 2021-10-08 14:51:04 +00:00)


## 14.5. Load the End-to-End model

In [None]:
#changing the working directory
%cd "/content/gdrive/MyDrive/Colab Notebooks/models"

/content/gdrive/MyDrive/Colab Notebooks/models


In [None]:
loaded_end_to_end_model = tf.keras.models.load_model("end_to_end_model")

## 14.6. Test the Loaded End-to-End model with Raw (Text) Data

In [None]:
raw_data=['Dün aldığım samsung telefon bugün şarj tutmuyor',
          'THY Uçak biletimi değiştirmek için başvurdum.  Kimse geri dönüş yapmadı!']
predictions=loaded_end_to_end_model.predict(raw_data)
print(id_to_category[np.argmax(predictions[0])])
print(id_to_category[np.argmax(predictions[1])])

alisveris
ulasim
time: 81.8 ms (started: 2021-12-03 14:12:35 +00:00)


In [None]:
loss, accuracy = loaded_end_to_end_model.evaluate(test_features,test_targets)
print("loaded_end_to_end_model accuracy: ", accuracy)

loaded_end_to_end_model accuracy:  0.9511488080024719
time: 46.1 s (started: 2021-10-08 14:51:13 +00:00)


# **PART E: SUMMARY**
In this tutorial, we have learned:
* What a Keras `TextVectorization` layer is
* Why we need to use a Keras `TextVectorization` layer in Natural Languge Processing (NLP) tasks
* How to employ a Keras `TextVectorization` layer in Text Preprocessing
* How to integrate a Keras `TextVectorization` layer to a trained model
* How to save and upload a Keras `TextVectorization` layer and a model with a Keras `TextVectorization` layer
* How to integrate a Keras `TextVectorization` layer with TensorFlow Data Pipeline API (`tf.data`)
* How to design, train, save, and load an End-to-End model using Keras `TextVectorization` layer

All above topics are presented in a **multi-class text classification** context.

If you like this tutorial, please follow the Murat Karakaya Akademi [YouTube channel](https://www.youtube.com/c/MuratKarakayaAkademi) and [Medium blog](https://kmkarakaya.medium.com/).

**Thank you for your patience!**

#Keep Deep Learning :)