# #3|Clickbait Headline Classification
---
### Clickbait

Clickbait, a form of false advertisement, uses hyperlink text or a thumbnail link that is designed to attract attention and to entice users to follow that link and read, view, or listen to the linked piece of online content, with a defining characteristic of being deceptive, typically sensationalized or misleading.

<img src="https://www.newswire.com/blog/wp-content/uploads/2019/07/Highres_Adelaine-Emoji-Clickbait.gif" height=300>


So today we will be dealing with click-bait data consisting of headlines labeled as clickbait or not.

## Let's Get Started.
----

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau

In [None]:
df = pd.read_csv('../input/clickbait-dataset/clickbait_data.csv')
df

In [None]:
df.info()

So here we have about 32000 headlines already labeled as true or false for a clickbait (1=True, 0=False).

But we would be using more data from few other kaggle datasets to build a more precise and stable classifier.
Other Datasets:-
### [Fake News Data](https://www.kaggle.com/antmarakis/fake-news-data)


In [None]:
# lets add and import our new data
fake_news_df = pd.read_csv('../input/fake-news-data/fnn_politics_fake.csv')
fake_news_df.head()

so we have a quite a few columns here we won't be needing all of them so lets drop some.

In [None]:
fake_news_df = fake_news_df.drop(['id', 'news_url', 'tweet_ids'], axis=1).rename(columns={'title': 'headline'})
fake_news_df['clickbait'] = 1
fake_news_df.head()

so here we have got more click bait headlines lets have some real headlines

In [None]:
real_news_df = pd.read_csv('../input/fake-news-data/fnn_politics_real.csv')
real_news_df = real_news_df.drop(['id', 'news_url', 'tweet_ids'], axis=1).rename(columns={'title': 'headline'})
real_news_df['clickbait'] = 0
real_news_df.head()

so now we have more non click-bait headlines

Let's put all data together and see if we got duplicates and enough data too

In [None]:
df = df.append(real_news_df, ignore_index = True)
df = df.append(fake_news_df, ignore_index = True)
df

so we have increased our data with 1056 rows. Lets check and remove the duplicate values if present.

In [None]:
df.drop_duplicates(keep=False, inplace=False)

The number of rows decreased to 32944 dropping 112 duplicates 

#### Let's check for number of clickbait headlines and real headlines.

Let's Add another column with human readable values such as 'clickbait' or 'real' as our categorical features. 

In [None]:
df['category'] = df['clickbait']
df['category'] = df['category'].map({1: 'Clickbait', 0: 'Real_News'})

In [None]:
counts = df.clickbait.value_counts()
print('''
Number of Clickbait headlines in data: {}
Number of real Headlines in data: {}
'''.format(
    counts[1],
    counts[0]))

fig = go.Figure(data=[go.Pie(labels=['Real_News','Clickbait'], values=counts, hole=.3)])
fig.show()

In [None]:
print('Difference: ',counts[0]-counts[1])

In [None]:
text = df['headline'].values
labels = df['clickbait'].values
text_train, text_test, y_train, y_test = train_test_split(text, labels, test_size=0.2)
print(text_train.shape, text_test.shape, y_train.shape, y_test.shape)

In [None]:
vocab_size = 5000
maxlen = 500
embedding_size = 32

tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(text)

X_train = tokenizer.texts_to_sequences(text_train)
x_test = tokenizer.texts_to_sequences(text_test)

X_train = pad_sequences(X_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)

In [None]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=maxlen))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

In [None]:
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5, min_delta=1e-4,)
ckpt = ModelCheckpoint(filepath='model.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min')
rlp = ReduceLROnPlateau(monitor='val_loss', patience=2, factor=0.2)

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train, y_train, batch_size=512, validation_data=(x_test, y_test), epochs=1000, callbacks=[es, ckpt, rlp])

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
x = range(1, len(acc) + 1)

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, acc, 'b', label='Training acc')
plt.plot(x, val_acc, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(x, loss, 'b', label='Training loss')
plt.plot(x, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

In [None]:
preds = [round(i[0]) for i in model.predict(x_test)]
cm = confusion_matrix(y_test, preds)
plt.figure()
plot_confusion_matrix(cm, figsize=(12,8), hide_ticks=True, cmap=plt.cm.Blues)
plt.xticks(range(2), ['Not clickbait', 'Clickbait'], fontsize=16)
plt.yticks(range(2), ['Not clickbait', 'Clickbait'], fontsize=16)
plt.show()

In [None]:
tn, fp, fn, tp = cm.ravel()

precision = tp/(tp+fp)
recall = tp/(tp+fn)

print("Recall of the model is {:.2f}".format(recall))
print("Precision of the model is {:.2f}".format(precision))

In [None]:
test = ['My biggest laugh reveal ever!', 'Learning game development with Unity', 'A tour of Japan\'s Kansai region', '12 things NOT to do in Europe']
token_text = pad_sequences(tokenizer.texts_to_sequences(test), maxlen=maxlen)
preds = [round(i[0]) for i in model.predict(token_text)]
for (text, pred) in zip(test, preds):
    label = 'Clickbait' if pred == 1.0 else 'Not Clickbait'
    print("{} - {}".format(text, label))