##### Copyright 2020 The TensorFlow Hub Authors.


In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/text/classify_text_with_bert"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/classify_text_with_bert.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/classify_text_with_bert.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/text/classify_text_with_bert.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
  <td>
    <a href="https://tfhub.dev/google/collections/bert/1"><img src="https://www.tensorflow.org/images/hub_logo_32px.png" />See TF Hub model</a>
  </td>
</table>

## About BERT

[BERT](https://arxiv.org/abs/1810.04805) and other Transformer encoder architectures have been wildly successful on a variety of tasks in NLP (natural language processing). They compute vector-space representations of natural language that are suitable for use in deep learning models. The BERT family of models uses the Transformer encoder architecture to process each token of input text in the full context of all tokens before and after, hence the name: Bidirectional Encoder Representations from Transformers. 

BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks.


## Setup


In [None]:
# A dependency of the preprocessing for BERT inputs
!pip install -q tensorflow-text

[K     |████████████████████████████████| 4.9 MB 28.8 MB/s 
[K     |████████████████████████████████| 462 kB 53.4 MB/s 
[?25h

You will use the AdamW optimizer from [tensorflow/models](https://github.com/tensorflow/models).

In [None]:
!pip install -q tf-models-official

[K     |████████████████████████████████| 2.2 MB 26.5 MB/s 
[K     |████████████████████████████████| 90 kB 10.3 MB/s 
[K     |████████████████████████████████| 47.8 MB 1.6 MB/s 
[K     |████████████████████████████████| 352 kB 57.2 MB/s 
[K     |████████████████████████████████| 43 kB 2.0 MB/s 
[K     |████████████████████████████████| 1.1 MB 54.1 MB/s 
[K     |████████████████████████████████| 636 kB 56.7 MB/s 
[K     |████████████████████████████████| 237 kB 75.5 MB/s 
[K     |████████████████████████████████| 1.2 MB 57.4 MB/s 
[K     |████████████████████████████████| 99 kB 11.7 MB/s 
[?25h  Building wheel for py-cpuinfo (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [None]:
import os
import shutil

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optmizer

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

In [None]:
import warnings
warnings.filterwarnings("ignore")

# 1. Store data

In [None]:
import pandas as pd
#I obtained this data that is put on Github that is received from Kaggle
df= pd.read_csv("https://raw.githubusercontent.com/Erfaniaa/fake-job-posting-prediction/master/dataset.csv") 
df

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17875,17876,Account Director - Distribution,"CA, ON, Toronto",Sales,,Vend is looking for some awesome new talent to...,Just in case this is the first time you’ve vis...,To ace this role you:Will eat comprehensive St...,What can you expect from us?We have an open cu...,0,1,1,Full-time,Mid-Senior level,,Computer Software,Sales,0
17876,17877,Payroll Accountant,"US, PA, Philadelphia",Accounting,,WebLinc is the e-commerce platform and service...,The Payroll Accountant will focus primarily on...,- B.A. or B.S. in Accounting- Desire to have f...,Health &amp; WellnessMedical planPrescription ...,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Internet,Accounting/Auditing,0
17877,17878,Project Cost Control Staff Engineer - Cost Con...,"US, TX, Houston",,,We Provide Full Time Permanent Positions for m...,Experienced Project Cost Control Staff Enginee...,At least 12 years professional experience.Abil...,,0,0,0,Full-time,,,,,0
17878,17879,Graphic Designer,"NG, LA, Lagos",,,,Nemsia Studios is looking for an experienced v...,1. Must be fluent in the latest versions of Co...,Competitive salary (compensation will be based...,0,0,1,Contract,Not Applicable,Professional,Graphic Design,Design,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15185 non-null  object
 8   benefits             10670 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

In [None]:
df["title_descib"] = df.title+" "+df.description
df.title_descib = df.title_descib.astype(str)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15185 non-null  object
 8   benefits             10670 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

# 2. Split data in to train, val, test


2.1 split df to train, test >> 80:20
2.2 split train to train, val making val to 20%


0.8x = 0.2

x = 0.2/0.8 = 0.25

Thus split train to train, val >> 75:25 to get train 60%, val 20%, test 20%

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(df.title_descib,df.fraudulent, test_size=0.2, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

In [None]:
print("train ", X_train.shape[0], y_train.shape[0])
print("val ", X_val.shape[0], y_val.shape[0])
print("test ", X_test.shape[0], y_test.shape[0])

train  10728 10728
val  3576 3576
test  3576 3576


# 3. create tf.data.Dataset

In [None]:
import tensorflow as tf

train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_ds= train_ds.batch(32)

val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
val_ds= val_ds.batch(32)


test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
test_ds= test_ds.batch(32)

In [None]:
bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8' 

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3


In [None]:
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
bert_model = hub.KerasLayer(tfhub_handle_encoder)


## Loading models from TensorFlow Hub

Here you can choose which BERT model you will load from TensorFlow Hub and fine-tune. There are multiple BERT models available.

  - [BERT-Base](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3), [Uncased](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3) and [seven more models](https://tfhub.dev/google/collections/bert/1) with trained weights released by the original BERT authors.
  - [Small BERTs](https://tfhub.dev/google/collections/bert/1) have the same general architecture but fewer and/or smaller Transformer blocks, which lets you explore tradeoffs between speed, size and quality.
  - [ALBERT](https://tfhub.dev/google/collections/albert/1): four different sizes of "A Lite BERT" that reduces model size (but not computation time) by sharing parameters between layers.
  - [BERT Experts](https://tfhub.dev/google/collections/experts/bert/1): eight models that all have the BERT-base architecture but offer a choice between different pre-training domains, to align more closely with the target task.
  - [Electra](https://tfhub.dev/google/collections/electra/1) has the same architecture as BERT (in three different sizes), but gets pre-trained as a discriminator in a set-up that resembles a Generative Adversarial Network (GAN).
  - BERT with Talking-Heads Attention and Gated GELU [[base](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1), [large](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_large/1)] has two improvements to the core of the Transformer architecture.

The model documentation on TensorFlow Hub has more details and references to the
research literature. Follow the links above, or click on the [`tfhub.dev`](http://tfhub.dev) URL
printed after the next cell execution.

The suggestion is to start with a Small BERT (with fewer parameters) since they are faster to fine-tune. If you like a small model but with higher accuracy, ALBERT might be your next option. If you want even better accuracy, choose
one of the classic BERT sizes or their recent refinements like Electra, Talking Heads, or a BERT Expert.

Aside from the models available below, there are [multiple versions](https://tfhub.dev/google/collections/transformer_encoders_text/1) of the models that are larger and can yield even better accuracy, but they are too big to be fine-tuned on a single GPU. You will be able to do that on the [Solve GLUE tasks using BERT on a TPU colab](https://www.tensorflow.org/tutorials/text/solve_glue_tasks_using_bert_on_tpu).

You'll see in the code below that switching the tfhub.dev URL is enough to try any of these models, because all the differences between them are encapsulated in the SavedModels from TF Hub.

The BERT models return a map with 3 important keys: `pooled_output`, `sequence_output`, `encoder_outputs`:

- `pooled_output` to represent each input sequence as a whole. The shape is `[batch_size, H]`. You can think of this as an embedding for the entire movie review.
- `sequence_output` represents each input token in the context. The shape is `[batch_size, seq_length, H]`. You can think of this as a contextual embedding for every token in the movie review.
- `encoder_outputs` are the intermediate activations of the `L` Transformer blocks. `outputs["encoder_outputs"][i]` is a Tensor of shape `[batch_size, seq_length, 1024]` with the outputs of the i-th Transformer block, for `0 <= i < L`. The last value of the list is equal to `sequence_output`.

For the fine-tuning you are going to use the `pooled_output` array.

## Define your model

You will create a very simple fine-tuned model, with the preprocessing model, the selected BERT model, one Dense and a Dropout layer.

Note: for more information about the base model's input and output you can use just follow the model's url for documentation. Here specifically you don't need to worry about it because the preprocessing model will take care of that for you.


In [None]:
def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')(text_input)
  Bert_layer = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')(preprocessing_layer)
  Bert_output = Bert_layer['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(Bert_output)
  output_layer = tf.keras.layers.Dense(1, activation="sigmoid", name='classifier')(net)
  return tf.keras.Model(text_input, output_layer)
  

Let's look at the model's structure




In [None]:
classifier_model1 = build_classifier_model()
#tf.keras.utils.plot_model(classifier_model1)


## Model training

You now have all the pieces to train a model, including the preprocessing module, BERT encoder, data, and classifier.

In [None]:
loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
metrics = tf.metrics.AUC()

# 4. Training and fine tune on BERT

In [None]:
epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

In [None]:
learning_rate = 1e-5
optimizer = optimization.create_optimizer(init_lr=learning_rate,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
callback = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=0)
classifier_model1.compile(optimizer=optimizer,
                        loss=loss,
                        metrics=metrics)


history1 = classifier_model1.fit(x=train_ds,
                            validation_data=val_ds,
                            epochs=epochs,
                            callbacks = [callback])


Epoch 1/5
Epoch 2/5


In [None]:
loss1, accuracy1 = classifier_model1.evaluate(test_ds)

print(f'Loss: {loss1}')
print(f'Accuracy: {accuracy1}')

Loss: 0.10475272685289383
Accuracy: 0.9153377413749695


In [None]:
classifier_model2 = build_classifier_model()
learning_rate = 2e-5
optimizer = optimization.create_optimizer(init_lr=learning_rate,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
callback = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=0)
classifier_model2.compile(optimizer=optimizer,
                        loss=loss,
                        metrics=metrics)


history2 = classifier_model2.fit(x=train_ds,
                            validation_data=val_ds,
                            epochs=epochs,
                            callbacks = [callback])

Epoch 1/5
Epoch 2/5


In [None]:
loss2, accuracy2 = classifier_model1.evaluate(test_ds)

print(f'Loss: {loss2}')
print(f'Accuracy: {accuracy2}')

Loss: 0.10475272685289383
Accuracy: 0.9153377413749695


In [None]:
classifier_model3 = build_classifier_model()
learning_rate = 3e-5
optimizer = optimization.create_optimizer(init_lr=learning_rate,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
callback = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=0)
classifier_model3.compile(optimizer=optimizer,
                        loss=loss,
                        metrics=metrics)


history3 = classifier_model3.fit(x=train_ds,
                            validation_data=val_ds,
                            epochs=epochs,
                            callbacks = [callback])

Epoch 1/5
Epoch 2/5


In [None]:
loss3, accuracy3 = classifier_model3.evaluate(test_ds)

print(f'Loss: {loss3}')
print(f'Accuracy: {accuracy3}')

Loss: 0.08090020716190338
Accuracy: 0.9435569047927856


In [None]:
classifier_model5 = build_classifier_model()
learning_rate = 5e-5
optimizer = optimization.create_optimizer(init_lr=learning_rate,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
callback = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=0)
classifier_model5.compile(optimizer=optimizer,
                        loss=loss,
                        metrics=metrics)


history5 = classifier_model5.fit(x=train_ds,
                            validation_data=val_ds,
                            epochs=epochs,
                            callbacks = [callback])

Epoch 1/5
Epoch 2/5


In [None]:
loss5, accuracy5 = classifier_model5.evaluate(test_ds)

print(f'Loss: {loss5}')
print(f'Accuracy: {accuracy5}')

Loss: 0.07940886169672012
Accuracy: 0.9514487981796265


In [None]:
def convert_val(val,thred =0.5):
  if val >= thred:
    return 1
  return 0

Split data to train 80%, test 20%

# 6. Undersampling process

In [None]:
X_train, X_test, y_train, y_test= train_test_split(df.title_descib,df.fraudulent, test_size=0.2, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

In [None]:
(y_train.value_counts().sum(),y_val.value_counts().sum(),y_test.value_counts().sum())

(10728, 3576, 3576)

In [None]:
df_train = pd.DataFrame(X_train).join(y_train)
df_train

Unnamed: 0,title_descib,fraudulent
16379,"Marketing Manager Gust, the world’s largest on...",0
4068,Security Engineer We are looking for highly sk...,0
8310,English Teacher Overseas (Conversational) Jobs...,0
1353,Commercial Director We are looking for a techn...,0
13730,Senior Territory Manager Recombine is advancin...,0
...,...,...
11282,"English Teacher Abroad Play with kids, get pa...",0
12951,Account Manager Company: Lamark MediaTitle: Ac...,0
4070,CUSTOMER SERVICE REPRESENTATIVE Community Heal...,1
5405,Developer (Integration) The Medopad team is se...,0


In [None]:
fake= df_train[df_train['fraudulent']==1]
real = df_train[df_train['fraudulent']==0]
real = real.sample(n=len(fake), random_state=101)
df_undersample = pd.concat([fake,real],axis=0)

In [None]:
df_undersample

Unnamed: 0,title_descib,fraudulent
15264,Immediate Staff Needed For Cash Positions. Imm...,1
493,Admin Assistant/ Receptionist A Newly establi...,1
2041,Head of machining department Corporate overvie...,1
2892,Marketing Database Decision Strategy Consultan...,1
15149,URGENT Full & Part Time Workers Needed. URGENT...,1
...,...,...
14914,Temporary Benefits Analyst Fortune Brands need...,0
8525,"Branch Manager Westview Financial Services, lo...",0
13352,Line Cook & Dishwashers Branded Saloon of Broo...,0
16772,Urgent Requirement : Dotnet Developer for UAE ...,0


In [None]:
balance_train_ds = tf.data.Dataset.from_tensor_slices((df_undersample.title_descib, df_undersample.fraudulent))
balance_train_ds= balance_train_ds.batch(32)

In [None]:
bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8' 

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3


In [None]:
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
bert_model = hub.KerasLayer(tfhub_handle_encoder)


In [None]:
def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')(text_input)
  Bert_layer = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')(preprocessing_layer)
  Bert_output = Bert_layer['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(Bert_output)
  output_layer = tf.keras.layers.Dense(1, activation="sigmoid", name='classifier')(net)
  return tf.keras.Model(text_input, output_layer)

In [None]:
classifier_model1_2 = build_classifier_model()

In [None]:
loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
metrics = tf.metrics.AUC()

In [None]:
epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(balance_train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

In [None]:
learning_rate = 1e-5
optimizer = optimization.create_optimizer(init_lr=learning_rate,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
callback = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=0)
classifier_model1_2.compile(optimizer=optimizer,
                        loss=loss,
                        metrics=metrics)


history1_2 = classifier_model1_2.fit(x= balance_train_ds,
                            validation_data= val_ds,
                            epochs=epochs,
                            callbacks = [callback])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
loss1, accuracy1 = classifier_model1_2.evaluate(test_ds)

print(f'Loss: {loss1}')
print(f'Accuracy: {accuracy1}')

Loss: 0.660708487033844
Accuracy: 0.8408603668212891


In [None]:
classifier_model2_2 = build_classifier_model()
learning_rate = 2e-5
optimizer = optimization.create_optimizer(init_lr=learning_rate,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
callback = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=0)
classifier_model2_2.compile(optimizer=optimizer,
                        loss=loss,
                        metrics=metrics)


history2_2 = classifier_model2_2.fit(x= balance_train_ds,
                            validation_data= val_ds,
                            epochs=epochs,
                            callbacks = [callback])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
loss2, accuracy2 = classifier_model2_2.evaluate(test_ds)

print(f'Loss: {loss2}')
print(f'Accuracy: {accuracy2}')

Loss: 1.2355387210845947
Accuracy: 0.8063482642173767


In [None]:
classifier_model3_2 = build_classifier_model()
learning_rate = 3e-5
optimizer = optimization.create_optimizer(init_lr=learning_rate,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
callback = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=0)
classifier_model3_2.compile(optimizer=optimizer,
                        loss=loss,
                        metrics=metrics)


history3_2 = classifier_model3_2.fit(x=balance_train_ds,
                            validation_data= val_ds,
                            epochs=epochs,
                            callbacks = [callback])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
loss3, accuracy3 = classifier_model3_2.evaluate(test_ds)

print(f'Loss: {loss3}')
print(f'Accuracy: {accuracy3}')

Loss: 0.7589532732963562
Accuracy: 0.8367089629173279


In [None]:
classifier_model5_2 = build_classifier_model()
learning_rate = 5e-5
optimizer = optimization.create_optimizer(init_lr=learning_rate,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
callback = tf.keras.callbacks.EarlyStopping(monitor='val_auc', patience=0)
classifier_model5_2.compile(optimizer=optimizer,
                        loss=loss,
                        metrics=metrics)


history5_2 = classifier_model5_2.fit(x=balance_train_ds,
                            validation_data= val_ds,
                            epochs=epochs,
                            callbacks = [callback])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
loss5, accuracy5 = classifier_model5_2.evaluate(test_ds)

print(f'Loss: {loss5}')
print(f'Accuracy: {accuracy5}')

Loss: 0.5317009091377258
Accuracy: 0.5317674875259399


In [None]:
def get_data_pred(model_name, dataset = val_ds):
  raw_val  = model_name.predict(dataset)
  pred_val= []
  for i in raw_val:
    pred_val.append(convert_val(i))
  return pred_val

In [None]:
from sklearn.metrics import classification_report

In [None]:
def report_model_for_val(model, true_label):
   print(classification_report(true_label,get_data_pred(model) , labels=[0,1]))

**# Note : I compared models on these 3 constantly **
- **Best unbalancing trained model** refered to *classifier_model5* or for short as model3
- **Best balancing trained model** refered to *classifier_model3_2* or shortly as model 3_2

To compare between the unbalanced and balanced trained model, I used the same learning rate on balanacing trained model to do so as use 
- *classifier_model5_2 *

# 5.+ 6. Analysis ************* Evaluation on validation data ****************** 

# Evaluate validation data on unbalanced training model VS balance training model

**- unbalanced trained model**

In [None]:
report_model_for_val(classifier_model5, y_val)

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      3391
           1       0.80      0.74      0.77       185

    accuracy                           0.98      3576
   macro avg       0.89      0.86      0.88      3576
weighted avg       0.98      0.98      0.98      3576



**- balanced trained model**

In [None]:
report_model_for_val(classifier_model3_2, y_val)

              precision    recall  f1-score   support

           0       0.99      0.50      0.67      3391
           1       0.09      0.92      0.17       185

    accuracy                           0.52      3576
   macro avg       0.54      0.71      0.42      3576
weighted avg       0.95      0.52      0.64      3576



In [None]:
report_model_for_val(classifier_model5_2, y_val)

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      3391
           1       0.00      0.00      0.00       185

    accuracy                           0.95      3576
   macro avg       0.47      0.50      0.49      3576
weighted avg       0.90      0.95      0.92      3576



**Does undersampling help reduce the F1 score of the fraudulent class?**

f1-score on class 1 ( fradulent class ) is reduced dramatically and some models can become 0. On class 0 ( none fradulent class) is reduced relatively small rate compared to class1.



# **Matrix Confusion on validation data**

*- on best model on unbalancing training model*

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_val, get_data_pred(classifier_model5))

array([[3357,   34],
       [  49,  136]])

*- on top 2 best models on balancing training model*

In [None]:
confusion_matrix(y_val, get_data_pred(classifier_model3_2))

array([[1699, 1692],
       [  14,  171]])

In [None]:
confusion_matrix(y_val, get_data_pred(classifier_model5_2))

array([[3391,    0],
       [ 185,    0]])

**Summary** : Overall, compared to the same learning rate modelN >> N indicates Ne-5 of the learning rate. Let's look at model5 and model5_2. The one with _2 means balanced model. The true positive increased, false negative decreased to 0 from unbalanced model to balanced model. However, the model3_2 is the best model to me, but its not doing good as true positive decreased in half, instead of increasing.

# AUC on validation data

*- on best model of unbalancing model*

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_val, classifier_model5.predict(val_ds))

0.960679700638415

*- on 2 model of balancing training model (best+ same learning rate as unbalanced model)*

In [None]:
roc_auc_score(y_val, classifier_model3_2.predict(val_ds))

0.8454350546358803

In [None]:
roc_auc_score(y_val, classifier_model5_2.predict(val_ds))

0.5730351407142915

**How does 
undersampling affect the overall validation AUC?**

**Summary** : AUC on balancing training models are worse compared to unbalanced one. 
While the best model of balanced model is 84%. So, it is not too bad.

--------------------------------------------------------------------------------

# ************* 7. Evaluation on testing data ******************

**My best model overall is model3_2**

In [None]:
def report_model_for_test(model, true_label):
   print(classification_report(true_label,get_data_pred(model, test_ds) , labels=[0,1]))

# Evaluate testing data on unbalanced training model VS balance training model

**- unbalanced trained model**

In [None]:
report_model_for_test(classifier_model5, y_test)

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      3395
           1       0.78      0.71      0.74       181

    accuracy                           0.98      3576
   macro avg       0.88      0.85      0.87      3576
weighted avg       0.97      0.98      0.97      3576



**- balanced training model**

In [None]:
report_model_for_test(classifier_model3_2, y_test)

              precision    recall  f1-score   support

           0       0.99      0.49      0.66      3395
           1       0.09      0.93      0.16       181

    accuracy                           0.51      3576
   macro avg       0.54      0.71      0.41      3576
weighted avg       0.95      0.51      0.63      3576



In [None]:
report_model_for_test(classifier_model5_2, y_test)

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      3395
           1       0.00      0.00      0.00       181

    accuracy                           0.95      3576
   macro avg       0.47      0.50      0.49      3576
weighted avg       0.90      0.95      0.92      3576



* **Summary **: f1 score of balanced trained model decreased on class 1 from unbalanced train model. The accuracy of the balanced model is increased to 0.51 of the best model. But the one in the same learning rate as unbalanced model as 5e-5, the accuracy is slightly decreased to .95.*

# **Matrix Confusion on testing data**

- on the best model o unbalancing training model

In [None]:
confusion_matrix(y_test, get_data_pred(classifier_model5, test_ds))

array([[3358,   37],
       [  52,  129]])

- on the best models (model 3_2) of the balancing training model and the one having the same learning rate as the best on unbalancing model.

In [None]:
confusion_matrix(y_test, get_data_pred(classifier_model3_2, test_ds))

array([[1672, 1723],
       [  12,  169]])

In [None]:
confusion_matrix(y_test, get_data_pred(classifier_model5_2, test_ds))

array([[3395,    0],
       [ 181,    0]])

**Summary **:  The balancing training model (5_2) is increased the true positive, and false negative numbers compared to unbalancing training model. However, the false negative is reduced from 129 to 0 which is not good. FP, TN are increased overall too. Therefore, confusion of unbalanced model is not better than unbalanced model.

The best model (3_2) is doing worst as TP is reduced in half.

# AUC on test data

*- on best model of unbalance training model*

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, classifier_model5.predict(test_ds))

0.9558824725994515

*- on 2 model2 of balancing training model (best model 3_2), same learning rate as unbalanceding model (5_2)*

In [None]:
roc_auc_score(y_test, classifier_model3_2.predict(test_ds))

0.836184183760649

In [None]:
roc_auc_score(y_test, classifier_model5_2.predict(test_ds))

0.5466675888331068

**Summary** : AUC on testing data is doing worse on balancing