# Covid19 Tweet Truth Analysis

Reference:
https://www.analyticsvidhya.com/blog/2021/09/creating-a-movie-reviews-classifier-using-tf-idf-in-python/

https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794

In [1]:
# setup CUDA
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


# Data preparation

This dataset contains the training, validation, and test csv's, along with excel documents for the train and test files, a csv with the test file actual values, and ERNIE test results. For this analysis, I will be ignoring the excel files (as they are the same as the csv's) and the ERNIE results. I will be acting as if the test answer file did not exist for the duration of the testing phase as well, thus sticking with a basic approach of train, validate, see what the model decides for the tests.

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import nltk #Natural Language Toolkit for Processing
from nltk.corpus import stopwords #Get the Stopwords to Remove

import re #Regular Expressions
import html #Messing with HTML content, like &amp;
import string #String Processing

import tensorflow as tf #Import tensorflow in order to use Keras
from tensorflow.keras.preprocessing.text import Tokenizer #Add the keras tokenizer for tweet tokenization
from tensorflow.keras.preprocessing.sequence import pad_sequences #Add padding to help the Keras Sequencing
import tensorflow.keras.layers as L #Import the layers as L for quicker typing
from tensorflow.keras.optimizers import Adam #Pull the adam optimizer for usage

from tensorflow.keras.losses import SparseCategoricalCrossentropy #Loss function being used
from sklearn.model_selection import train_test_split #Train Test Split

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [5]:
import os

path = "/content/gdrive/My Drive/PLP_sharing/project/fake_news"
os.chdir(path)

In [6]:
os.listdir('./')

['tfidf',
 'covid19-fake-news-dataset-nlp.zip',
 'covid19-fake-news-dataset-nlp-unzip',
 'BiLSTM.ipynb',
 'bert',
 'Transformer-Explainability',
 'BERT_explainability.ipynb',
 'lqq_transformer1',
 'lqq_transformer2',
 'w2c-glove',
 'Fake Detection.gdoc',
 'late_fusion',
 'tfidf_nb.pkl']

In [None]:
# !unzip covid19-fake-news-dataset-nlp.zip -d covid19-fake-news-dataset-nlp-unzip

In [7]:
twTrain = pd.read_csv("./covid19-fake-news-dataset-nlp-unzip/Constraint_Train.csv") #Load the tweet (tw) training set
twTrain.head() #Take a peek at the data

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [8]:
twValid = pd.read_csv("./covid19-fake-news-dataset-nlp-unzip/Constraint_Val.csv") #Load the tweet (tw) validation set
twValid.head() #Take a peek at the data

Unnamed: 0,id,tweet,label
0,1,Chinese converting to Islam after realising th...,fake
1,2,11 out of 13 people (from the Diamond Princess...,fake
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real


---

# Check for Null Values

In [None]:
print("Training Set:\n", twTrain.isnull().any()) #Check for null values in the training set
print("Validation Set:\n", twValid.isnull().any()) #Check for null values in the validation set
print("Testing Set:\n", twTest.isnull().any()) #Check for null values in the testing set

Training Set:
 id       False
tweet    False
label    False
dtype: bool
Validation Set:
 id       False
tweet    False
label    False
dtype: bool
Testing Set:
 id       False
tweet    False
dtype: bool


There are no null values in the dataset.

---

# Data Exploration

In [None]:
print(twTrain["tweet"][0]) #Print a simple tweet example
print(twTrain["tweet"][300]) #Print a more typical tweet example

The CDC currently reports 99031 deaths. In general the discrepancies in death counts between different sources are small and explicable. The death toll stands at roughly 100000 people today.
NEW: There have been numerous #COVID19 outbreaks on recent cruise ship voyages. @CDCDirector has extended the previous No Sail Order to prevent the spread of COVID-19 among crew onboard. https://t.co/OTWJgCN8wQ https://t.co/sbHX4p907F


It appears there are more dry tweets along with more typical tweets (with hashtags and links). The typical tweet examples exist, so I will have to do more usual tweet cleaning.

In [None]:
print("Training Labels:\n", twTrain["label"].value_counts()) #See the training labels
print("Validation Labels:\n", twValid["label"].value_counts()) #See the validation labels

Training Labels:
 real    3360
fake    3060
Name: label, dtype: int64
Validation Labels:
 real    1120
fake    1020
Name: label, dtype: int64


The labels appear to be pretty balanced in number. I will definitely need to get dummies for these to make real and fake into 1 and 0, but the fact that the labels are balanced in number means the model should pick up on these labels without too much difficulty.

---

# Tweet Processing

In [9]:
punctuations = string.punctuation #List of punctuations to remove
print(punctuations) #See the punctuations the string library has

STOP = stopwords.words("english") #Get the NLTK stopwords
print(STOP) #See what NLTK considers stopwords

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only

In [10]:
#CleanTweets: parses the tweets and removes punctuation, stop words, digits, and links.
#Input: the list of tweets that need parsing
#Output: the parsed tweets
def cleanTweets(tweetParse):
    for i in range(0,len(tweetParse)):
        tweet = tweetParse[i] #Putting the tweet into a variable so that it is not calling tweetParse[i] over and over
        tweet = html.unescape(tweet) #Removes leftover HTML elements, such as &amp;
        tweet = re.sub(r"@\w+", " ", tweet) #Completely removes @'s, as other peoples' usernames mean nothing
        tweet = re.sub(r"http\S+", " ", tweet) #Removes links, as links provide no data in tweet analysis in themselves
        
        tweet = "".join([punc for punc in tweet if not punc in punctuations]) #Removes the punctuation defined above
        tweet = tweet.lower() #Turning the tweets lowercase real quick for later use
    
        tweetWord = tweet.split() #Splits the tweet into individual words
        tweetParse[i] = "".join([word + " " for word in tweetWord if not word in STOP]) #Checks if the words are stop words
        
    return tweetParse #Returns the parsed tweets

This code is reworked from my original coronavirus tweet sentiment analysis from earlier in the pandemic (https://www.kaggle.com/lunamcbride24/coronavirus-tweet-processing). I have changed it to use NLTK instead of spacy since those stopwords do not require building a spacy model. I have also used the string library to get punctuation instead of having a bulky hard-coded list and removed the number remover, as I feel that numbers may be a key factor here (especially with the usage of the name Covid-19, since that may have lost the 19 and became just covid, which has a different connotation). These were factors I wanted to change about the original after playing with Keras for TripAdvisor reviews (https://www.kaggle.com/lunamcbride24/hotel-review-keras-classification-project). 

This may be a note to myself, but I did both of those projects half a year ago. This is why you should keep your code well-commented.

In [11]:
twTrain["cleanTweet"] = cleanTweets(twTrain["tweet"].copy()) #Clean the training tweets
twTrain.head() #Take a look at the dataset

Unnamed: 0,id,tweet,label,cleanTweet
0,1,The CDC currently reports 99031 deaths. In gen...,real,cdc currently reports 99031 deaths general dis...
1,2,States reported 1121 deaths a small rise from ...,real,states reported 1121 deaths small rise last tu...
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake,politically correct woman almost uses pandemic...
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real,indiafightscorona 1524 covid testing laborator...
4,5,Populous states can generate large case counts...,real,populous states generate large case counts loo...


In [12]:
twValid["cleanTweet"] = cleanTweets(twValid["tweet"].copy()) #Clean the validation tweets
twValid.head() #Take a peek at the dataset

Unnamed: 0,id,tweet,label,cleanTweet
0,1,Chinese converting to Islam after realising th...,fake,chinese converting islam realising muslim affe...
1,2,11 out of 13 people (from the Diamond Princess...,fake,11 13 people diamond princess cruise ship inti...
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake,covid19 caused bacterium virus treated aspirin
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake,mike pence rnc speech praises donald trump’s c...
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real,610 skys explains latest covid19 data governme...


---

# Check for Post-Processing Blank Tweets

In [15]:
print("Training: \n", twTrain.loc[twTrain["cleanTweet"] == ""]) #Check for Training Blank Tweets
print("Validation: \n", twValid.loc[twValid["cleanTweet"] == ""]) #Check for Validation Blank Tweets
print("Testing: \n", twTest.loc[twTest["cleanTweet"] == ""]) #Check for Testing Blank Tweets

Training: 
 Empty DataFrame
Columns: [id, tweet, label, cleanTweet]
Index: []
Validation: 
 Empty DataFrame
Columns: [id, tweet, label, cleanTweet]
Index: []
Testing: 
 Empty DataFrame
Columns: [id, tweet, cleanTweet]
Index: []


In [16]:
print(twTrain["tweet"][300]) #Print a more typical tweet example
print(twTrain["cleanTweet"][300]) #Print the tweet after processing to show link and stopword removal

NEW: There have been numerous #COVID19 outbreaks on recent cruise ship voyages. @CDCDirector has extended the previous No Sail Order to prevent the spread of COVID-19 among crew onboard. https://t.co/OTWJgCN8wQ https://t.co/sbHX4p907F
new numerous covid19 outbreaks recent cruise ship voyages extended previous sail order prevent spread covid19 among crew onboard 


There were no blank tweets created in any set. Tweets can become blank if they were just user names and links, so I just needed to make sure.

---

# Label Encoding

Interestingly, the get_dummies function in pandas will create encoded labels, since this is a binary classification problem. The real column created by it would have 1 for real and 0 for not real, which necessarily means fake in this case. That is the same as label encoding in this case.

In [13]:
dummyTrain = pd.get_dummies(twTrain["label"]) #Get the dummies for the training set
print(dummyTrain) #Show the dummies

      fake  real
0        0     1
1        0     1
2        1     0
3        0     1
4        0     1
...    ...   ...
6415     1     0
6416     1     0
6417     1     0
6418     1     0
6419     0     1

[6420 rows x 2 columns]


That real column shows the encoded values for real vs fake. I will be taking the real column as the encoded values.

In [14]:
twTrain["encodedLabel"] = dummyTrain["real"].astype('int') #Get the encoded labels from the "real" dummies
twTrain.head() #Take a peek at the data

Unnamed: 0,id,tweet,label,cleanTweet,encodedLabel
0,1,The CDC currently reports 99031 deaths. In gen...,real,cdc currently reports 99031 deaths general dis...,1
1,2,States reported 1121 deaths a small rise from ...,real,states reported 1121 deaths small rise last tu...,1
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake,politically correct woman almost uses pandemic...,0
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real,indiafightscorona 1524 covid testing laborator...,1
4,5,Populous states can generate large case counts...,real,populous states generate large case counts loo...,1


In [15]:
twValid["encodedLabel"] = pd.get_dummies(twValid["label"])["real"].astype('int') #Get the encoded labels for the validation set
twValid.head() #Take a peek at the data

Unnamed: 0,id,tweet,label,cleanTweet,encodedLabel
0,1,Chinese converting to Islam after realising th...,fake,chinese converting islam realising muslim affe...,0
1,2,11 out of 13 people (from the Diamond Princess...,fake,11 13 people diamond princess cruise ship inti...,0
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake,covid19 caused bacterium virus treated aspirin,0
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake,mike pence rnc speech praises donald trump’s c...,0
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real,610 skys explains latest covid19 data governme...,1


---

# TF-IDF

In [16]:
train_X = twTrain["cleanTweet"]   # '0' refers to the review text
train_y = twTrain["encodedLabel"]   # '1' corresponds to Label (1 - positive and 0 - negative)
test_X = twValid['cleanTweet']
test_y = twValid["encodedLabel"]

In [17]:
import pandas as pd
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

#tf idf
tf_idf = TfidfVectorizer(max_features=200, use_idf=True)
#applying tf idf to training data
X_train_tf = tf_idf.fit_transform(train_X)
#applying tf idf to training data
X_train_tf = tf_idf.transform(train_X)

In [18]:
print("n_samples: %d, n_features: %d" % X_train_tf.shape)

n_samples: 6420, n_features: 200


In [19]:
# Now, we transform the test data into TF-IDF matrix format.

#transforming test data into tf-idf matrix
X_test_tf = tf_idf.transform(test_X)

print("n_samples: %d, n_features: %d" % X_test_tf.shape)
print(X_test_tf.shape)

n_samples: 2140, n_features: 200
(2140, 200)


In [20]:
import pickle

pickle.dump(tf_idf, open("./tfidf/tfidf.pickle", "wb"))

# Naive Bayes Classifier

In [21]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

#naive bayes classifier
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tf, train_y)

#predicted y
y_pred = naive_bayes_classifier.predict(X_test_tf)

## Estimation

In [22]:
print(metrics.classification_report(test_y, y_pred, target_names=['Fake', 'Real']))

              precision    recall  f1-score   support

        Fake       0.92      0.87      0.90      1020
        Real       0.89      0.93      0.91      1120

    accuracy                           0.91      2140
   macro avg       0.91      0.90      0.91      2140
weighted avg       0.91      0.91      0.91      2140



## Save model

In [34]:
import pickle
# now you can save it to a file
with open('./tfidf/tfidf_nb.pkl', 'wb') as f:
    pickle.dump(naive_bayes_classifier, f)

# # and later you can load it
# with open('filename.pkl', 'rb') as f:
#     clf = pickle.load(f)

## Model Inference Demo

In [35]:
with open('./tfidf/tfidf.pickle', 'rb') as f:
    saved_tf_idf = pickle.load(f)

with open('./tfidf/tfidf_nb.pkl', 'rb') as f:
    saved_nb = pickle.load(f)

In [36]:
test_text_fake = 'Alfalfa is the only cure for COVID-19.'
test_text_real = '#IndiaFightsCorona India has one of the lowest #COVID19 mortality globally with less than 2% Case Fatality Rate. As a result of supervised home isolation &amp; effective clinical treatment many States/UTs have CFR lower than the national average. https://t.co/QLiK8YPP7E'

cleaned = cleanTweets([test_text_fake, test_text_real])

test_input = saved_tf_idf.transform(cleaned)
test_input.shape

(2, 10000)

In [37]:
#0= bad review
#1= good review
preds = saved_nb.predict(test_input)
for res in preds:
  if res==1:
    print("Real Covid News")
  elif res==0:
    print("Fake Covid News")

Fake Covid News
Real Covid News


# LSTM

In [44]:
X_train_tfidf = X_train_tf.toarray()

X_test_tfidf = X_test_tf.toarray()

In [45]:
train_y.head()

0    1
1    1
2    0
3    1
4    1
Name: encodedLabel, dtype: int64

In [83]:
# train
print(X_train_tfidf.shape, X_train_tfidf.dtype)
# x_train_lstm = np.array(X_train_tfidf)
x_train_lstm = X_train_tfidf.reshape(-1, 1, 200)
print(x_train_lstm.shape, x_train_lstm.dtype)

aa = X_train_tfidf[:2]
print(aa.shape, type(aa))
bb = aa.reshape(-1, 1, 200)
print(bb.shape)

y_lstm = np.array(train_y)


# test
# x_test_lstm = np.array(X_test_tfidf)
y_test_lstm = np.array(test_y)
x_test_lstm = X_test_tfidf.reshape(-1, 1, 200)
print(x_test_lstm.shape)

(6420, 200) float64
(6420, 1, 200) float64
(2, 200) <class 'numpy.ndarray'>
(2, 1, 200)
(2140, 1, 200)


In [47]:
print(x_train_lstm.shape)
print(x_test_lstm.shape)

(6420, 1, 200)
(2140, 1, 200)


In [48]:
print("Length of train set is",len(x_train_lstm))
print("Length of label set is",len(y_lstm))

Length of train set is 6420
Length of label set is 6420


In [49]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
import numpy as np

data_dim = 200
#tunable parameter
batch_size = 50
epochs = 5

In [50]:
model = Sequential()
model.add(LSTM(100, input_shape=(None, data_dim),return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(200))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])

In [51]:
lstm_history = model.fit(x_train_lstm[0:5136],y_lstm[0:5136], validation_data = (x_train_lstm[5136:],y_lstm[5136:]), batch_size=batch_size, epochs=epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Estimination

In [52]:
results = model.evaluate(x_test_lstm, y_test_lstm)



In [None]:
matrix = metrics.confusion_matrix(y_test_lstm.argmax(axis=1), y_pred.argmax(axis=1))

In [53]:
from sklearn import metrics

y_pred = (model.predict(x_test_lstm).ravel()>0.5)+0 # predict and get class (0 if pred < 0.5 else 1)
print(metrics.classification_report(test_y, y_pred, target_names=['Fake', 'Real']))

              precision    recall  f1-score   support

        Fake       0.85      0.87      0.86      1020
        Real       0.88      0.86      0.87      1120

    accuracy                           0.87      2140
   macro avg       0.87      0.87      0.87      2140
weighted avg       0.87      0.87      0.87      2140



## Save the model

In [54]:
# save the model

import pickle
# now you can save it to a file
with open('./tfidf/tfidf_lstm.pkl', 'wb') as f:
    pickle.dump(model, f)



INFO:tensorflow:Assets written to: ram://ff31196d-6906-462a-910a-1eb9e4774772/assets


INFO:tensorflow:Assets written to: ram://ff31196d-6906-462a-910a-1eb9e4774772/assets


## Inference Demo

In [87]:
with open('./tfidf/tfidf.pickle', 'rb') as f:
    saved_tf_idf = pickle.load(f)

with open('./tfidf/tfidf_lstm.pkl', 'rb') as f:
    saved_lstm = pickle.load(f)


test_text_fake = 'Alfalfa is the only cure for COVID-19.'
test_text_real = '#IndiaFightsCorona India has one of the lowest #COVID19 mortality globally with less than 2% Case Fatality Rate. As a result of supervised home isolation &amp; effective clinical treatment many States/UTs have CFR lower than the national average. https://t.co/QLiK8YPP7E'

cleaned = cleanTweets([test_text_fake, test_text_real])
test_tfidf = saved_tf_idf.transform(cleaned).toarray()
x_test = test_tfidf.reshape(-1,1,200)

print(x_test.shape, x_test.dtype)

#0= Fake news
#1= Real news
preds = (saved_lstm.predict(x_test).ravel()>0.5)+0
for res in preds:
  if res==1:
    print("Real Covid News")
  elif res==0:
    print("Fake Covid News")

print(preds)


(2, 1, 200) float64
Fake Covid News
Real Covid News
[0 1]


# CNN

In [30]:
from keras.models import Sequential
from keras.wrappers.scikit_learn import KerasClassifier
from keras.layers import Dense, Dropout
from keras.utils import np_utils

In [39]:
X = X_train_tf.astype('float16')
X_test = X_test_tf.astype('float16')

In [31]:
y = twTrain["encodedLabel"].values.tolist()
y_test= twValid["encodedLabel"].values.tolist()

dummy_y_train = np_utils.to_categorical(y)

In [32]:
print(dummy_y_train.shape)
print(X_train_tf.shape)

(6420, 2)
(6420, 200)


In [37]:
# Model Training 
print ("Create model ... ")
def build_model():
    model = Sequential()
    model.add(Dense(256, input_dim=200, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(200, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(160, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(120, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(80, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

Create model ... 


In [40]:
print("Compile model ...")
estimator = KerasClassifier(build_fn=build_model, epochs=15, batch_size=128)
estimator.fit(X[0:5136], dummy_y_train[0:5136], validation_data = (X[5136:], dummy_y_train[5136:]))

Compile model ...
Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_18 (Dense)            (None, 256)               51456     
                                                                 
 dropout_15 (Dropout)        (None, 256)               0         
                                                                 
 dense_19 (Dense)            (None, 200)               51400     
                                                                 
 dropout_16 (Dropout)        (None, 200)               0         
                                                                 
 dense_20 (Dense)            (None, 160)               32160     
                                                                 
 dropout_17 (Dropout)        (None, 160)               0         
                                                                 
 dense_21 (Dense)            (None, 

  
  "shape. This may consume a large amount of memory." % value)


Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f96f231cf50>

## Estimination

In [41]:
# Predictions 
print ("Predict on test data ... ")
y_pred = estimator.predict(X_test)

Predict on test data ... 


In [42]:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.91      0.87      1020
           1       0.91      0.84      0.87      1120

    accuracy                           0.87      2140
   macro avg       0.87      0.87      0.87      2140
weighted avg       0.88      0.87      0.87      2140



## Save the model

In [43]:
# save the model

import pickle
# now you can save it to a file
with open('./tfidf/tfidf_cnn.pkl', 'wb') as f:
    pickle.dump(estimator, f)

INFO:tensorflow:Assets written to: ram://b9972d99-18a2-48eb-b000-4fb3d1ac788a/assets


## Inference Demo