<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---

<p align="center"><h1 align="center">Quick Start: Clickbait Detection Text Classification Tutorial</h1> 

##### <p align="center">*Dataset Adapted From: Abhijnan Chakraborty, Bhargavi Paranjape, Sourya Kakarla, and Niloy Ganguly. "Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media”. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Fransisco, US, August 2016.* 
---

**Instructions:**
1.   Get data in and set up X_train / X_test / y_train
2.   Preprocess data using keras Tokenizer/ Write and Save Preprocessor function
3. Fit model on preprocessed data and save preprocessor function and model 
4. Generate predictions from X_test data and submit model to competition
5. Repeat submission process to improve place on leaderboard



## 1. Get data in and set up X_train, X_test, y_train objects

In [None]:
#install aimodelshare library
! pip install aimodelshare --upgrade

In [2]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/clickbait_competition_data-repository:latest') 


Data downloaded successfully.


In [3]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
X_train=pd.read_csv("clickbait_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("clickbait_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("clickbait_competition_data/y_train.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0       MyBook Disk Drive Handles Lots of Easy Backups
1                       CIT Posts Eighth Loss in a Row
2    Candy Carson Singing The "National Anthem" Is ...
3    Why You Need To Stop What You're Doing And Dat...
4    27 Times Adele Proved She's Actually The Reale...
Name: headline, dtype: object

##2.   Preprocess data using keras tokenizer / Write and Save Preprocessor function


In [4]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

In [5]:
# Check shape of data preproprecessed using your new preprocessor() function
print(preprocessor(X_train, maxlen=40, max_words=10000).shape)
print(preprocessor(X_test, maxlen=40, max_words=10000).shape)

(24979, 40)
(6245, 40)


##3. Fit model on preprocessed data and save preprocessor function and model 


In [6]:
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(10000, 16, input_length=40))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))
model.summary()

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(preprocessor(X_train), y_train,
                    epochs=1,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 40, 16)            160000    
                                                                 
 flatten (Flatten)           (None, 640)               0         
                                                                 
 dense (Dense)               (None, 2)                 1282      
                                                                 
Total params: 161,282
Trainable params: 161,282
Non-trainable params: 0
_________________________________________________________________


#### Save preprocessor function to local "preprocessor.zip" file

In [7]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


#### Save model to local ".onnx" file

In [8]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

## 4. Generate predictions from X_test data and submit model to competition


In [9]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://wc9yefhdca.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this Clickbait Identification Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [10]:
#Instantiate Competition

mycompetition= ai.Competition(apiurl)

In [11]:
#Submit Model 1: 

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 3

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1309


In [12]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,lstm_layers,embedding_layers,flatten_layers,dense_layers,bidirectional_layers,softmax_act,tanh_act,loss,optimizer,model_config,memory_size,username,version
1,96.89%,96.89%,96.88%,96.90%,keras,False,True,Sequential,5,174658,2.0,1,1.0,1,,1,2.0,function,RMSprop,"{'name': 'sequential', 'layers...",2233624,AIModelShare,1
2,96.06%,96.05%,96.12%,96.02%,keras,False,True,Sequential,3,161282,,1,1.0,1,,1,,str,RMSprop,"{'name': 'sequential', 'layers...",1412632,AIModelShare,3
3,78.46%,78.38%,79.41%,78.74%,keras,False,True,Sequential,5,191362,2.0,1,,1,1.0,1,2.0,function,RMSprop,"{'name': 'sequential_11', 'lay...",3354488,AIModelShare,2


## 5. Repeat submission process to improve place on leaderboard


In [13]:
# Train and submit model 2 using same preprocessor (note that you could save a new preprocessor, but we will use the same one for this example).
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten

model2 = Sequential()
model2.add(Embedding(10000, 16, input_length=40))
model2.add(LSTM(32, return_sequences=True, dropout=0.2))
model2.add(LSTM(32, dropout=0.2))
model2.add(Flatten())
model2.add(Dense(2, activation='softmax'))

model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model2.fit(preprocessor(X_train), y_train,
                    epochs=1,
                    batch_size=32,
                    validation_split=0.2)



In [14]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [15]:
#Submit Model 2: 

#-- Generate predicted y values (Model 2)
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 4

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1309


In [17]:
# Compare two or more models 
data=mycompetition.compare_models([3, 4], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_3_Layer,Model_3_Shape,Model_3_Params,Model_4_Layer,Model_4_Shape,Model_4_Params
0,Embedding,"[None, 40, 16]",160000.0,Embedding,"[None, 40, 16]",160000
1,Flatten,"[None, 640]",0.0,LSTM,"[None, 40, 32]",6272
2,Dense,"[None, 2]",1282.0,LSTM,"[None, 32]",8320
3,,,,Flatten,"[None, 32]",0
4,,,,Dense,"[None, 2]",66


## Optional: Tune model within range of hyperparameters with Keras Tuner

*Simple example shown below. Consult [documentation](https://keras.io/guides/keras_tuner/getting_started/) to see full functionality.*

In [None]:
! pip install keras_tuner

In [19]:
#Separate validation data 
from sklearn.model_selection import train_test_split
x_train_split, x_val, y_train_split, y_val = train_test_split(
     X_train, y_train, test_size=0.2, random_state=42)

In [20]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten
import keras_tuner as kt

#Define model structure & parameter search space with function
def build_model(hp):
    model = keras.Sequential()
    model.add(Embedding(10000, 16, input_length=40))
    model.add(LSTM(units=hp.Int("units", min_value=32, max_value=512, step=32), #range 32-512 inclusive, minimum step between tested values is 32
                   return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
    model.add(Flatten())
    model.add(Dense(2, activation='softmax'))
    model.compile(
        optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"],
    )
    return model

#initialize the tuner (which will search through parameters)
tuner = kt.RandomSearch(
    hypermodel=build_model, 
    objective="val_accuracy", # objective to optimize
    max_trials=3, #max number of trials to run during search
    executions_per_trial=1, #higher number reduces variance of results; guages model performance more accurately 
    overwrite=True,
    directory="tuning_model",
    project_name="tuning_units",
)

tuner.search(preprocessor(x_train_split), y_train_split, epochs=1, validation_data=(preprocessor(x_val), y_val))


Trial 3 Complete [00h 02m 25s]
val_accuracy: 0.9423539042472839

Best val_accuracy So Far: 0.9659727811813354
Total elapsed time: 00h 14m 21s


In [21]:
# Build model with best hyperparameters

# Get the top 2 hyperparameters.
best_hps = tuner.get_best_hyperparameters(5)
# Build the model with the best hp.
tuned_model = build_model(best_hps[0])
# Fit with the entire dataset.
tuned_model.fit(x=preprocessor(X_train), y=y_train, epochs=1)




<keras.callbacks.History at 0x7f3ec0486ed0>

In [22]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(tuned_model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("tuned_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [23]:
#Submit Model 3: 

#-- Generate predicted y values (Model 3)
prediction_column_index=tuned_model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "tuned_model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 5

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1309


In [24]:
# Get leaderboard

data = mycompetition.get_leaderboard()
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,lstm_layers,embedding_layers,flatten_layers,dense_layers,bidirectional_layers,softmax_act,tanh_act,loss,optimizer,model_config,memory_size,username,version
0,97.07%,97.06%,97.12%,97.03%,keras,False,True,Sequential,4,244482,1.0,1,1.0,1,,1,1.0,str,RMSprop,"{'name': 'sequential_1', 'laye...",8573008,AIModelShare,5
2,96.89%,96.89%,96.88%,96.90%,keras,False,True,Sequential,5,174658,2.0,1,1.0,1,,1,2.0,function,RMSprop,"{'name': 'sequential', 'layers...",2233624,AIModelShare,1
3,96.56%,96.55%,96.54%,96.57%,keras,False,True,Sequential,5,174658,2.0,1,1.0,1,,1,2.0,str,RMSprop,"{'name': 'sequential_1', 'laye...",13583792,AIModelShare,4
4,96.06%,96.05%,96.12%,96.02%,keras,False,True,Sequential,3,161282,,1,1.0,1,,1,,str,RMSprop,"{'name': 'sequential', 'layers...",1412632,AIModelShare,3
5,78.46%,78.38%,79.41%,78.74%,keras,False,True,Sequential,5,191362,2.0,1,,1,1.0,1,2.0,function,RMSprop,"{'name': 'sequential_11', 'lay...",3354488,AIModelShare,2


In [25]:
# Compare two or more models
data=mycompetition.compare_models([4,5], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_4_Layer,Model_4_Shape,Model_4_Params,Model_5_Layer,Model_5_Shape,Model_5_Params
0,Embedding,"[None, 40, 16]",160000,Embedding,"[None, 40, 16]",160000.0
1,LSTM,"[None, 40, 32]",6272,LSTM,"[None, 40, 128]",74240.0
2,LSTM,"[None, 32]",8320,Flatten,"[None, 5120]",0.0
3,Flatten,"[None, 32]",0,Dense,"[None, 2]",10242.0
4,Dense,"[None, 2]",66,,,
