## ML Challenge WS 2022/23

#### Task:

Your Task is to train a clickbait filter to classify clickbait articles by their headline. You freely decide how to prepare the data and which ML model to use for classification.

The challenge is considered passed if your model performs better than our baseline (a simple classifier; F1 ~0.89). Report at least the F1 score of your classifier. Your model will be evaluated using a hold out dataset. Please prepare a script so your trained model can be evaluated with this dataset.

#### Dataset:

The data consists of two files, a text file with clickbait headlines and one with headlines from news sources. The hold out dataset is organized the same way.

## Submission by :
    Abhijit Ramsajivan Yadav (1619172)
    Saurabh Mishra (1608460)
    Sohail Ahmed Qadri (1620630)
    

In [1]:
with open('clickbait_yes') as f:
    lines = [line.rstrip() for line in f]

lines[0:10]

['Guys Try Tinder',
 'Michael B. Jordan Got Laid The Fuck Out While Filming "Creed"',
 'What\'s The Most Fucked Up Thing You\'ve Done On "Rollercoaster Tycoon"',
 'How Far Would You Make It In The Hunger Games',
 "If Matthew Gray Gubler's Tweets Were Motivational Posters",
 "Here's What Everyone Wore To The Glamour Women Of The Year Awards",
 'How Many Of These Black Sitcoms Have You Seen',
 '17 Reasons You Should Love Eddie Redmayne',
 "Here's What Lady Gaga's Next Album Should Sound Like, According To Her Little Monsters",
 'Are These People Smiling At A Baby Or A Laptop']

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
from sklearn.model_selection import train_test_split

In [3]:
#load the data
df_data_yes = pd.read_csv("./clickbait_yes", delimiter='\n', header=None, names=["headline"])
df_data_no = pd.read_csv("./clickbait_no", delimiter='\n', header=None, names=["headline"])
df_label_yes = pd.DataFrame(np.ones((df_data_yes.shape[0])),columns=['label'])
df_label_no = pd.DataFrame(np.zeros((df_data_no.shape[0])),columns=['label'])
df_headline = pd.concat([df_data_yes, df_data_no], ignore_index=True)
headline = df_headline.headline.values
df_label = pd.concat([df_label_yes, df_label_no], ignore_index=True)
label = df_label.label.values


In [4]:
#preprocess
tokenizer = Tokenizer()
tokenizer.fit_on_texts(headline)
sequences = tokenizer.texts_to_sequences(headline)
data = pad_sequences(sequences)
print(data)

[[    0     0     0 ...   522   159  1619]
 [    0     0     0 ...   294  4237  6085]
 [    0     0     0 ...     9 12394  6086]
 ...
 [    0     0     0 ...    30  1447   278]
 [    0     0     0 ...  1909     8  1895]
 [    0     0     0 ...     4  2923   307]]


#### Questions?

[kuglerk@uni-trier.de](mailto:kuglerk@uni-trier.de?subject=ML%20Challenge%20NLU)

In [5]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data, df_label, test_size=0.2)

In [6]:
print(X_train.shape)

(23040, 21)


In [7]:
#model

model = Sequential()
vocab_size = 25000
max_length = 21
model.add(Embedding(input_dim=vocab_size, output_dim=128, input_length=max_length))
model.add(LSTM(units=64))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [8]:
# Train the model

model.fit(X_train, y_train, epochs=2, batch_size=32, validation_data=(X_test, y_test))


Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x22015200b20>

In [9]:
# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(X_test, y_test)
print("Test accuracy:", test_acc)



Test accuracy: 0.9756944179534912


In [10]:
from sklearn.metrics import f1_score
# predict crisp classes for test set
yhat_classes = model.predict(X_test, verbose=0)
yhat_classes = np.around(yhat_classes)
print(yhat_classes)
#f1 calculation
f1 = f1_score(y_test, yhat_classes)
print('F1 score: %f' % f1)

[[0.]
 [1.]
 [1.]
 ...
 [1.]
 [1.]
 [0.]]
F1 score: 0.975593


In [11]:
#evaluation data 
df_eval = pd.read_csv("./clickbait_hold_X.csv", header=None, names=["headline"])
eval = df_eval.headline.values
tokenizer.fit_on_texts(eval)
sequences = tokenizer.texts_to_sequences(eval)
eval = pad_sequences(sequences)
#predict the values
yhat_classes = model.predict(eval, verbose=0)
yhat_classes = np.around(yhat_classes)
yhat_classes = yhat_classes.flatten()

#export the file

df_eval = pd.DataFrame(yhat_classes.astype(int))

df_eval.to_csv("precited_labels for new data.text",index=False,header=False)