Skip to content

LorenzoNorcini/Clickbait-Detector

Repository files navigation

Clickbait-Detector

Detects clickbait headlines using a SVM classifier

Requirements

Usage

(Optional) Register an application on Reddit and add your information in the dataset_builder.py

reddit = praw.Reddit(client_id='*',
                     client_secret='*',
                     password='*',
                     user_agent='*',
                     username='*')
                         

First time run

python train.py

This will load the dataset (NTRD file), train the classifier and save it.
If you want to download recent titles from reddit delete the NTRD file and then re-run train.py
(NOTE: this will remove current titles obtained from Reddit since there is no check for duplicates)
Then you can call the predict.py script passing the string of the headline as a parameter.

python predict.py "this is a test headline"

Data

The dataset used is the one built by saurabhmathur96 (https://github.com/saurabhmathur96)
plus some titles found on the subreddits r/news, r/inthenews and r/savedyouaclick.

Implementation Details

The following operations are used as preprocessing for the dataset:

  • tokenizing
  • lemmatizing
  • stopwords are removed
  • words shorter than 2 characters are removed

The Bag of Words assumption is used and the features comprise n-grams up the 3.
The value of such features is calculated using term frequency–inverse document frequency (tf-idf).

Results

Train size: 12336
Validazion size: 1449
Test size: 723

Train Validation Test
Accuracy 0.99 0.88 0.90
F1 Score 0.99 0.87 0.89
Recall 0.99 0.90 0.91
Precision 0.99 0.85 0.87

About

Clickbait titles detection using SVM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages