# 1. Introduction

Nikki Satmaka - Batch 11

## Description

Dataset is taken from [Kaggle](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset)

Context:

This dataset contains 

### Objective

- pass

### Problem Statement

- pass

## Prepare Dataset

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# prepare kaggle environment
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/03-resources/kaggle/kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# download dataset
!mkdir data
!kaggle datasets download --p data --unzip clmentbisaillon/fake-and-real-news-dataset

In [None]:
# install dependencies
!pip install transformers
!pip install feature-engine --quiet --progress-bar off
!python -m spacy download en_core_web_sm --quiet --progress-bar off
!python -m nltk.downloader stopwords punkt --quiet 

# copy packages directory
!cp -r /content/drive/MyDrive/03-resources/python_pkgs/packages ./

# 2. Importing Libraries

In [None]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

import warnings
warnings.filterwarnings('ignore')

from wordcloud import WordCloud, STOPWORDS

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

import spacy

# Split Dataset and Standarize the Datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer


# Neural Network
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from tensorflow.keras import Input
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import TextVectorization, Embedding
from tensorflow.keras.layers import InputLayer, Dense
from tensorflow.keras.layers import GlobalAveragePooling1D
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM, GRU

from transformers import pipeline
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# Evaluate Classification Models
from sklearn.metrics import classification_report

# Useful functions
from packages.checker import check_missing
from packages.checker import check_links_only
from packages.outlier_handling import outlier_summary
from packages.visualization import kdeplot, plot_loss, plot_acc
from packages.imputation_handling import drop_title_links_only

from packages.text_preprocessing import combine_text
from packages.text_preprocessing import clean_text, strip_stopwords 
from packages.text_preprocessing import stem_text, lemmatize_text


pd.set_option('display.precision', 2)

sns.set_theme(style='darkgrid', palette='Set1')

# set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

%matplotlib inline

# 3. Data Loading

In [None]:
# load dataset
df_fake_ori = pd.read_csv("data/Fake.csv")
df_real_ori = pd.read_csv("data/True.csv")

# make a copy of the original dataframe
df_fake = df_fake_ori.copy()
df_real = df_real_ori.copy()

# display the first 5 entries of fake news data
df_fake.head()

In [None]:
# display the first 5 entries of real news data
df_real.head()

## Data Understanding

In [None]:
# check dataset shape
print(f"Fake news dataset shape: {df_fake.shape}")
print(f"Real news dataset shape: {df_real.shape}")

There are 21417 instances and 4 columns of real news data\
There are 23481 instances and 4 columns of fake news data

In [None]:
# check fake news dataset info
df_fake.info()

In [None]:
# check real news dataset info
df_real.info()

Both dataset have their date as string object. I'm going to convert them to datetime object.\
However, I'm going to combine them first. Both dataframes have the same features, so we can safely proceed.

## Combine Dataset
Since the dataset was separated between real and fake news, let's combine them into one dataframe

In [None]:
# define label, 0 for fake news, 1 for real news
df_fake['label'] = 0
df_real['label'] = 1

# concat datasets and reset index
df_ori = pd.concat([df_fake, df_real]).reset_index(drop=True)

# create backup
df = df_ori.copy()

# display the first five rows of the dataset
df.head()

In [None]:
# check dataset shape
df.shape

There are 44,898 instances of data with 5 columns

## Check Missing values and Duplicates

In [None]:
# check missing values in dataset
check_missing(df)

Great! There are no missing values

In [None]:
# check duplicate values in dataset
df[df.duplicated()]

We found some duplicates in our dataset. Let's drop them

In [None]:
# drop duplicates
df = df[~df.duplicated()]

In [None]:
df.shape

We now have 44,689 instances of data left

## Check for Dataset Imbalance

Check whether the label of the dataset is balance

In [None]:
# check for imbalance in label
plt.figure(figsize=(4,5))
sns.countplot(data=df, x='label')
plt.title('Number of Fake VS Real News')
plt.xlabel(None)
plt.ylabel(None)
plt.ylim(0, df.shape[0] / 1.5)
plt.xticks([0, 1], ['Fake', 'Real'])

plt.show()

We can see that our data has similar number of fake and real news. Hence, the data is quite balanced and we won't need to oversample or undersample

## Splitting Dataset

We need to split the dataset into train and test sets before we do any EDA.\
We do our EDA on the train set so as to not have any bias towards the whole dataset.

### Split train and test set

In [None]:
# split sets to training+validation and testing sets
df_train_valid, df_test = train_test_split(
    df,
    test_size=0.20,
    random_state=42,
    stratify=df['label']
)

print('df_train_valid Size:', df_train_valid.shape)
print('df_test Size:', df_test.shape)

### Split train and validation set

In [None]:
# split sets to training and validation sets
df_train, df_valid = train_test_split(
    df_train_valid,
    test_size=0.20,
    random_state=42,
    stratify=df_train_valid['label']
)

print('df_train Size:', df_train.shape)
print('df_valid Size:', df_valid.shape)

In [None]:
# print datasets shape
print(f'df_train shape: {df_train.shape}')
print(f'df_valid shape: {df_valid.shape}')
print(f'df_test shape: {df_test.shape}')

In [None]:
# backup the train set that we are gonna perform EDA on
df_train_ori = df_train.copy()

# 4. Exploratory Data Analysis

## Subjects

In [None]:
# plot number of news according to subjects
plt.figure(figsize=(15, 6))
sns.countplot(
    data=df_train,
    x='subject',
    hue='label',
    order=df_train['subject'].value_counts().index,
)
plt.title(f'No. of news according to subjects grouped by fake status')
plt.xlabel(None)
plt.ylabel('No. of news')

plt.legend(labels=['Fake', 'Real'])
plt.show()

We can see that real and fake news have totally different subjects. This might be a giveaway if we were to include this feature in our machine learning model later on

## Date

In [None]:
# attempt to convert date to datetime object to analyze time intervals
try:
    pd.to_datetime(df_train['date'])
except Exception as e:
    print(e)

That's weird. Why would there be a string in a date feature? Let alone a link.\
Let's check for any links in the date feature

In [None]:
# check for dates which contains link
df_train[df_train['date'].str.contains('http')]

There are 6 of them, and it's not just the date feature. The same links can also be found in the title and text features. This could be treated as missing values since they don't contain actually news

## Corpus

In [None]:
# display the first instance of the dataset
df_train.iloc[0]

In [None]:
# display the title of the first instance 
df_train.iloc[0]['title']

In [None]:
# display the title of the first text 
df_train.iloc[0]['text']

It seems like `title` and `text` are two features which make the most important content of the news. Since NLP are performed on a body of text, I'm only going to use one feature, so I'm going to combine these features into one later on

## Wordcloud

In [None]:
# create word cloud object
wc =  WordCloud(
    max_words=2000,
    stopwords=STOPWORDS
)

In [None]:
# generate word cloud
wc.generate(' '.join(df_train['text']))

# plot wordcloud
plt.figure(figsize=(18, 12))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

We can see that `Donald Trump` and `White House` show up quite often.\
Let's separate between the fake and real news and see how they differ

In [None]:
# generate word cloud for fake news
wc.generate(' '.join(df_train[df_train['label'] == 0]['text']))

# plot wordcloud
plt.figure(figsize=(18, 12))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

For fake news, there's and additional word, `said` that popped out. This is interesting.

In [None]:
# generate word cloud
wc.generate(' '.join(df_train[df_train['label'] == 1]['text']))

# plot wordcloud
plt.figure(figsize=(18, 12))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

For real news, the words are pretty similar, like `said`, `United State`, and also `Donald Trump`

## Number of Words

In [None]:
# create a feature to store the number of words in the text feature
df_train['words'] = df_train['text'].str.split().apply(len)

In [None]:
# plot number of words
plt.figure(figsize=(6, 10))
sns.boxplot(data=df_train, x='label', y='words', showfliers=False)
plt.title(f'Words per News')
plt.xlabel(None)
plt.ylabel('No. of words')
plt.xticks([0, 1], ['Fake', 'Real'])

plt.show()

We can see that both fake and real news have about 400 words in their body of text. However, the first quartile for real news is lower than that of fake news\
However, we had the outliers hidden for that plot, now let's show them

In [None]:
# plot number of words
plt.figure(figsize=(6, 10))
sns.boxplot(data=df_train, x='label', y='words', showfliers=True)
plt.title(f'Words per News')
plt.xlabel(None)
plt.ylabel('No. of words')
plt.xticks([0, 1], ['Fake', 'Real'])

plt.show()

It's now a totally different story. We can see that there are a couple fake news which have a lot more words compared to real news

## Unique Words

In [None]:
# create a feature to store the number of words in the text feature
df_train['unique_words'] = df_train['text'].str.split().apply(np.unique).apply(len)

In [None]:
# plot number of unique words
plt.figure(figsize=(6, 8))
sns.barplot(data=df_train, x='label', y='unique_words', ci=None)
plt.title(f'Unique Words per News')
plt.xlabel(None)
plt.ylabel('No. of words')
plt.xticks([0, 1], ['Fake', 'Real'])

plt.show()

The number of unique words in fake news is slightly higher than in real news

# 5. Data Preprocessing

In [None]:
# restore the train set from the backup
df_train = df_train_ori.copy()

In [None]:
# split between features and label
X_train = df_train.drop(['label'], axis=1)
y_train = df_train['label'].copy()

X_valid = df_valid.drop(['label'], axis=1)
y_valid = df_valid['label'].copy()

X_test = df_test.drop(['label'], axis=1)
y_test = df_test['label'].copy()

## Handling Missing Values

We don't have any nan missing values. However, we found out during our EDA that some instances in our dataset contains nothing but links. Therefore, we are going to drop these entries as they do not provide any value in 

In [None]:
# check links only value in train set
check_links_only(X_train)

In [None]:
# check links only value in validation set
check_links_only(X_valid)

In [None]:
# check links only value in test set
check_links_only(X_test)

It seems like there are some instances of data whose title is not link, but the text contains only link. Let's check these data

In [None]:
# display the first five rows of data which have normal titles, but link in text
X_train[
    ~(X_train['title'].str.contains(r'^http\S+$', regex=True)) &
    (X_train['text'].str.contains(r'^http\S+$', regex=True))
].head()

It seems like the links in the text point out to different websites.\
I'm not going to drop these kinds of instances, since the values in the title feature could still be useful as predictors.\
Therefore, I'm only dropping instances of data which title is links

In [None]:
# list of features that we want to impute
impute_cols = ['title', 'date']

In [None]:
# print dataset shape before handling links in title
print('X_train and y_train shape before handling links in title:', X_train.shape, y_train.shape)
print('X_valid and y_valid shape before handling links in title:', X_valid.shape, y_valid.shape)
print('X_test and y_test shape before handling links in title:', X_test.shape, y_test.shape)

print('=' * 80)

# drop instances of data which has link as its title
X_train, y_train_final = drop_title_links_only(X_train, y_train, impute_cols)
X_valid, y_valid_final = drop_title_links_only(X_valid, y_valid, impute_cols)
X_test, y_test_final = drop_title_links_only(X_test, y_test, impute_cols)

# print dataset shape after handling links in title
print('X_train and y_train_final shape after handling links in title:', X_train.shape, y_train_final.shape)
print('X_valid and y_valid_final shape after handling links in title:', X_valid.shape, y_valid_final.shape)
print('X_test and y_test_final shape after handling links in title:', X_test.shape, y_test_final.shape)


Great! We have no more missing values

## Feature Selection

In [None]:
# display the first five rows of the train set
X_train.head()

- We mentioned that we are going to combine the `title` and `text` features since they both make up the major part of a news
- We're going to drop `subject`, since we found out during our EDA that fake and real news have totally different subjects
- We're also going to drop `date`, since date do not have any influence in an NLP model. We're not attempting to find a pattern on when a fake news might be released. We're attempting to spot a fake news based on its content
- Therefore, our dataset will only contain the feature which contains news from `title` and `text`

In [None]:
# combine title and text features as news
X_train_combined = combine_text(X_train, 'news', 'title', 'text')
X_valid_combined = combine_text(X_valid, 'news', 'title', 'text')
X_test_combined = combine_text(X_test, 'news', 'title', 'text')

# keep only the news feature as the only text to process
X_train_combined = X_train_combined['news']
X_valid_combined = X_valid_combined['news']
X_test_combined = X_test_combined['news']

# print datasets shape
print(f'X_train_combined shape: {X_train_combined.shape}')
print(f'X_valid_combined shape: {X_valid_combined.shape}')
print(f'X_test_combined shape: {X_test_combined.shape}')

## Text Preprocessing

In [None]:
# display the first five rows of the train set
X_train_combined.head()

In [None]:
%%time

# clean text
X_train_cleaned = X_train_combined.apply(clean_text)
X_valid_cleaned = X_valid_combined.apply(clean_text)
X_test_cleaned = X_test_combined.apply(clean_text)

# lemmatize text
X_train_lemmatized = X_train_cleaned.apply(lemmatize_text)
X_valid_lemmatized = X_valid_cleaned.apply(lemmatize_text)
X_test_lemmatized = X_test_cleaned.apply(lemmatize_text)

# # stem text
# X_train_stemmed = X_train_lemmatized.apply(stem_text)
# X_valid_stemmed = X_valid_lemmatized.apply(stem_text)
# X_test_stemmed = X_test_lemmatized.apply(stem_text)

## Create TensorFlow input pipelines

In [None]:
# define final dataset
X_train_final = X_train_lemmatized
X_valid_final = X_valid_lemmatized
X_test_final = X_test_lemmatized

In [None]:
# define batch size
batch_size = 32

# create tf dataset instance 
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_final, y_train_final)).batch(batch_size).cache()
valid_dataset = tf.data.Dataset.from_tensor_slices((X_valid_final, y_valid_final)).batch(batch_size).cache()
test_dataset = tf.data.Dataset.from_tensor_slices((X_test_final, y_test_final)).batch(batch_size).cache()

## Tokenization

In [None]:
# declare vectorizer object
Vectorize = CountVectorizer()

# fit vectorize object to the train set
Vectorize.fit(X_train_final)

In [None]:
# display the top ten vocabs
print(f'There are {len(Vectorize.vocabulary_.keys())} vocabulary')
print('These are the top ten:')
print(list(Vectorize.vocabulary_.keys())[:10])

In [None]:
# define percentage of vocab to include
pct_vocab = 0.75

# define max number of features to include
max_features = int(np.floor(0.75 * len(Vectorize.vocabulary_.keys())))

# define max number of sequence length based on mean sentence length
max_seq_length = int(np.floor(np.mean([len(i.split(' ')) for i in X_train_final])))

# print max number of vocabs and output sequence length
print(f'Max number of vocabs: {max_features}')
print(f'Output sequence length: {max_seq_length}')

## TextVectorization

In [None]:
# define text vectorizer layer
text_vectorizer = TextVectorization(
    max_tokens=max_features,
    standardize='lower_and_strip_punctuation',
    split='whitespace',
    ngrams=None,
    output_mode='int',
    output_sequence_length=max_seq_length,
)

# adapt vectorization layer to the train set
text_vectorizer.adapt(X_train_final)

## Embedding

In [None]:
# define embbeding layer
embedding_layer = Embedding(
    input_dim=max_features,
    output_dim=128,
    input_length=max_seq_length,
)

# 6. Model Definition

- Target: Predicting whether the client would stop their subscription and leave the company


- Predictors: The features I'm going to use are


- Models: I'm going to use 

## DNN Model

Running this model as minimal as possible. Using 1 hidden layer with 8 neurons

In [None]:
# instantiate input object
inputs = Input(shape=(1,), dtype=tf.string)

# preprocess inputs
preprocessed_inputs = text_vectorizer(inputs)

# apply model layers
x = embedding_layer(preprocessed_inputs)
x = Dropout(rate=0.2)(x)
x = GlobalAveragePooling1D()(x)
x = Dropout(rate=0.2)(x)
x = Dense(units=64, activation='relu')(x)
x = Dropout(rate=0.2)(x)
outputs = Dense(units=1, activation='sigmoid')(x)
model_dnn = Model(inputs=inputs, outputs=outputs)

# compile model
model_dnn.compile(
    loss='binary_crossentropy',
    optimizer=keras.optimizers.Adam(lr=0.001),
    metrics=['accuracy']
)

# display model summary for functional model
model_dnn.summary()

In [None]:
# plot model architecture
keras.utils.plot_model(model_dnn, show_shapes=True)

## LSTM Model

Attempting to improve the sequential model using initializer, regularizer, and also dropout

In [None]:
# instantiate input object
inputs = Input(shape=(1,), dtype=tf.string)

# preprocess inputs
preprocessed_inputs = text_vectorizer(inputs)

# apply model layers
x = embedding_layer(preprocessed_inputs)
x = Dropout(rate=0.2)(x)
x = LSTM(units=64, activation='tanh')(x)
x = Dropout(rate=0.2)(x)
outputs = Dense(units=1, activation='sigmoid')(x)
model_lstm = Model(inputs=inputs, outputs=outputs)

# compile model
model_lstm.compile(
    loss='binary_crossentropy',
    optimizer=keras.optimizers.Adam(lr=0.001),
    metrics=['accuracy']
)

# display model summary for functional model
model_lstm.summary()

In [None]:
# plot model architecture
keras.utils.plot_model(model_lstm, show_shapes=True)

## GRU Model

Attempting to improve the sequential model using initializer, regularizer, and also dropout

In [None]:
# instantiate input object
inputs = Input(shape=(1,), dtype=tf.string)

# preprocess inputs
preprocessed_inputs = text_vectorizer(inputs)

# apply model layers
x = embedding_layer(preprocessed_inputs)
x = Dropout(rate=0.2)(x)
x = GRU(units=64, activation='tanh')(x)
x = Dropout(rate=0.2)(x)
outputs = Dense(units=1, activation='sigmoid')(x)
model_gru = Model(inputs=inputs, outputs=outputs)

# compile model
model_gru.compile(
    loss='binary_crossentropy',
    optimizer=keras.optimizers.Adam(lr=0.001),
    metrics=['accuracy']
)

# display model summary for functional model
model_gru.summary()

In [None]:
# plot model architecture
keras.utils.plot_model(model_gru, show_shapes=True)

## Callbacks objects

In [None]:
# define callbacks
callbacks = [
    EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
    ModelCheckpoint(filepath='models/checkpoint', monitor='val_loss', save_best_only=True, verbose=0)
]

# 7. Model Training

In [None]:
# create dictionary of models
models = {
    'dnn': model_dnn,
    'lstm': model_lstm,
    'gru': model_gru,
}

In [None]:
%%time

# create dictionary to store metrics
metrics = {}

# loop through models and train
for name, model in models.items():
    # train model
    history = model.fit(
        train_dataset,
        epochs=30,
        validation_data=valid_dataset,
        callbacks=callbacks,
        verbose=0
    )

    # store metrics
    metrics[name] = pd.DataFrame(history.history)

# 8. Model Evaluation

In [None]:
# create dictionary to store evaluation metrics
eval_metrics = {
    'dnn': {},
    'lstm': {},
    'gru': {},
}

# loop through models and evaluate them
for name, model in models.items():
    # evaluate model
    eval_metrics[name]['loss'], eval_metrics[name]['accuracy'] = model.evaluate(
        test_dataset,
        verbose=0
    )

# create dataframe from evaluation metrics
eval_metrics_df = pd.DataFrame(eval_metrics).T

# display evaluation metrics
eval_metrics_df

We can see that the loss actually increased after being tuned. The accuracy also slightly decreased

In [None]:
plt.figure(figsize=(18, 12))

# plot the loss curves
for i, (name, metric) in enumerate(metrics.items()):
    plt.subplot(2, 2, i + 1)
    plot_loss(metric)
    plt.title(f'Training and validation loss for {name}')

plt.tight_layout()
plt.show()

- Even though the sequential and functional model uses the same hyperparameter, the results are actually slightly different
- We can see from the graph, that the functional model is a bit overfitted as we started to see the gap widening the higher the epoch.
- The sequential model also had a widening of the gap, albeit more subtly.
- The tuned models, be it sequential or functional are now a much more better fit, we could even call it a good fit. We also have the validation loss to be slightly lower than the training loss.
- We do have to note though that the absolute value of the loss increased after tuning. So we still have to see how it performs when we use it to predict the test set

In [None]:
plt.figure(figsize=(18, 12))

# plot the accuracy curves
for i, (name, metric) in enumerate(metrics.items()):
    plt.subplot(2, 2, i + 1)
    plot_acc(metric)
    plt.title(f'Training and validation accuracy for {name}')

plt.tight_layout()
plt.show()

- We can see that the accuracy also somewhat decreased after tuning
- The validation accuracy for the tuned models are quite high from the start but didn't increase much as it stabilized later on

## Prepare Evaluation

In [None]:
# set threshold for prediction
threshold = 0.5

# create dictionary to store predictions:
predictions = {
    'dnn': {},
    'lstm': {},
    'gru': {},
}

# loop through models and make predictions
for name, model in models.items():
    # make predictions for training set
    pred_train = model.predict(X_train_final).reshape(-1)
    pred_train = np.where(pred_train > threshold, 1, 0)

    # make predictions for test set
    pred_test = model.predict(X_test_final).reshape(-1)
    pred_test = np.where(pred_test > threshold, 1, 0)

    # store predictions in dictionary
    predictions[name]['train'] = pred_train
    predictions[name]['test'] = pred_test

In [None]:
# prepare target names for classification report
target_names = ["Fake", "Real"]

## DNN Evaluation

In [None]:
# print classification report for dnn model
for name, preds in predictions.items():
    if 'dnn' not in name:
        continue
    for dataset, pred in preds.items():
        if dataset == 'train':
            print(f'{name} classification report for training set:')
            print(classification_report(y_train_final, pred, target_names=target_names))
        if dataset == 'test':
            print(f'{name} classification report for testing set:')
            print(classification_report(y_test_final, pred, target_names=target_names))

- We can now see clearly that both model are not actually a very good fit. I don't think it can be called overfit either though. So I'd call it a decent fit, but it sure could be improved, as there is a `0.04` gap between training accuracy and testing accuracy
- We can also see that tuning the model successfully increased the recall score. This is important to us as we want to minimize **False Negatives**, since we need to detect potential of churning as much as possible.
- The tuned recall score is `0.79` which is quite decent
- Therefore, our model could be run on inference, but we should also strive to improve this model further

## LSTM Evaluation

In [None]:
# print classification report for lstm model
for name, preds in predictions.items():
    if 'lstm' not in name:
        continue
    for dataset, pred in preds.items():
        if dataset == 'train':
            print(f'{name} classification report for training set:')
            print(classification_report(y_train_final, pred, target_names=target_names))
        if dataset == 'test':
            print(f'{name} classification report for testing set:')
            print(classification_report(y_test_final, pred, target_names=target_names))

- The fit situation is quite similar with the sequential models. However, the fit in the tuned model actually became even wider, with a `0.07` gap between training accuracy and testing accuracy
- Tuning the model increased the recall score substantially to `0.82`, which is good
- So this model is also good, but I'm quite concerned with the wider gap indicating overfit.

## GRU Model Evaluation

In [None]:
# print classification report for gru model
for name, preds in predictions.items():
    if 'gru' not in name:
        continue
    for dataset, pred in preds.items():
        if dataset == 'train':
            print(f'{name} classification report for training set:')
            print(classification_report(y_train_final, pred, target_names=target_names))
        if dataset == 'test':
            print(f'{name} classification report for testing set:')
            print(classification_report(y_test_final, pred, target_names=target_names))

- pass


## Analysis
- Tuning successfully increased the recall score
- The fit on the functional_tuned model actually became wider, indicating that the fit became a tad worse
- Therefore, considering the fit and also the recall score, I'm choosing the **sequential_tuned** model as the best model to be saved and run

## Save The Final Model

In [None]:
# prepare directory for saving model
model_dir = 'models'
model_name = 'nlp_model'

# create directory if it does not exist
Path(model_dir).mkdir(parents=True, exist_ok=True)

model_path = Path(model_dir, model_name)

# save model
model_lstm.save(model_path)

# 9. Model Inference

## Load The Model

In [None]:
# model location
model_dir = 'models'
model_name = 'nlp_model'

# create path object
model_path = Path(model_dir, model_name)

# load model
model = keras.models.load_model(model_path)

## Prepare Data For Inferencing

In [None]:
# prepare data for inferencing

# create dataframe for inferencing
new_data = pd.DataFrame(new_data)

In [None]:
# display dataframe for inferencing
new_data

## Inferencing

In [None]:
# impute missing values
new_data_prepared = impute_total_charges(new_data)

# impute no phone service and no internet service with no
new_data_prepared = impute_no_phone_internet(new_data_prepared)

# print shape of prepared data
print(new_data_prepared.shape)

In [None]:
%%time

# set threshold for prediction
threshold = 0.5

# scale inference set
new_data_scaled = scaler.transform(new_data_prepared)

# encode inference set
new_data_encoded = encoder.transform(new_data_scaled)

# cast as float32
new_data_final = new_data_encoded.astype(np.float32)

# predict inference set using the final model
y_pred_new = model.predict(new_data_final).reshape(-1)
y_pred_new = np.where(y_pred_new > threshold, 1, 0)

In [None]:
# create dataframe with predictions
new_data['pred'] = y_pred_new

# display inference set
new_data

Model successfully run on inference dataset

# 10. Conclusion

## On EDA
- pass

## On Modeling
- pass

## Implication
- pass

## Future Improvement
- pass