# Introduction


### In past couple years number of news outlets and users who consume news are skyrocketing. This has led several websites competing with each other for grabbing users attention. One of the common ways to this is called "Clickbait". Clickbait, a form of false advertisement, uses hyperlink text or a thumbnail link that is designed to attract attention and to entice users to follow that link and read, view, or listen to the linked piece of online content, with a defining characteristic of being deceptive, typically sensationalized or misleading. A "teaser" aims to exploit the "curiosity gap", providing just enough information to make readers of news websites curious, but not enough to satisfy their curiosity without clicking through to the linked content.(Wikipedia)

### In this notebook we're going to analyze some Turkish news article titles to decide if they are clickbait or not. We're also going to build some models to build a classifier that can distinguish between actual news and what is likely to be clickbait. For this we will use "SadedeGel" library with it's Turkish news corpus trained on various sources.

## Sadedegel:

![logo](https://sadedegel.ai/dist/img/logo-2.png)

###  [SadedeGel](https://github.com/GlobalMaksimum/sadedegel) is an open-source library developed during the NLP OpenHack organized by [Turkey Open Source Platform](https://www.turkiyeacikkaynakplatformu.com/). The library and its ecosystem is awarded 2nd prize in the hackathon and has been in development ever since. 

### "Sadede Gel" means "Cut to the chase" 🏃‍♂️. The main idea was to perform extractive summarization of news over a chrome extension; however during development it extended to become a utility ecosystem with the building blocks, datasets, annotation tools, various tokenizers and summarizers for the task of extractive summarization. 

### Going forward with the development, it is now essential to accomodate other NLP tasks in Turkish. SadedeGel's building blocks in current stage are mature enough to consume and process Turkish news documents and output processed data for downstream tasks. Processed outpus may be separated sentences with an ML based SBD, tokens tokenized either by Transformers BERT-TR or a rule based tokenizer, TF-IDF vectors based on a vocabulary built on extensive Turkish news data, BERT embeddings from Turkish BERT model. Enhancements with Word2Vec, Doc2Vec, FastText and ELECTRA Turkish will be on following releases. 

### You can also check out the project's [MadeWithML page](https://madewithml.com/projects/2048/sadedegel-an-extraction-based-turkish-news-summarizer/)

# Getting Things Ready and Loading the Data

##### Here we install SadedeGel using pip installer, pretty easy!

In [None]:
# installing stadedegel package from pip

!pip install sadedegel

#### We load usual stuff for NLP tasks also loading SadedeGel's building blocks.

In [None]:
# some basic tools

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# loading sadedegel packages

from sadedegel import Doc, Token
from sadedegel.bblock.word_tokenizer_helper import puncts
from sadedegel.bblock.util import tr_lower


# some extra nlp packages

import nltk
stop_word_list = nltk.corpus.stopwords.words('turkish')
from collections import Counter, defaultdict
from nltk.probability import FreqDist
from wordcloud import WordCloud

#

import random
import time
import itertools

#

from matplotlib.ticker import MaxNLocator
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches
plt.style.use('fivethirtyeight')

#

seed=42

import warnings
warnings.filterwarnings('ignore') 

In [None]:
df = pd.read_csv('../input/turkishnewstitle20000clickbaitclassified/20000_turkish_news_title.csv')

#### Our data seems simple enough:

- id: for identifiying specific observation row.
- clickbait: 1 for clickbait, 0 for actual news titles.
- site: Source of the news title.
- title: Actual news title from various news sites.

In [None]:
df.head()

In [None]:
df.shape

#### For this task we're not going to judge which site has more clickbait or not therefore we can include only 'title' and 'clickbait' features. It'll be enough for our classification task. There are some missing target samples in our data so we should drop them too...

In [None]:
df = df[['title','clickbait']]

In [None]:
display(df.isna().sum())

In [None]:
# getting rid of nan rows

df.dropna(inplace=True)

# Meta Features

#### In this part we'll analyse some basic meta features of our data. Like target distribution, character/word counts per title. Pretty simple stuff but can give us some insights...

## Target

#### Our target distribution looks nicely balanced which is good for classification tasks.

In [None]:
# Displaying target distribution.

fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(18, 6), dpi=100)
sns.countplot(df['clickbait'], ax=axes[0])
axes[1].pie(df['clickbait'].value_counts(),
            labels=['No Bait', 'Clickbait'],
            autopct='%1.2f%%',
            shadow=True,
            explode=(0.05, 0),
            startangle=60)
fig.suptitle('Distribution of the Target', fontsize=24)
plt.show()

## Character Counts

### Here we observe:

- Actual news titles are much longer than clickbait titles.
- Actual news titles having more than 100 characters usually.
- Meanwhile clickbait titles have median around 50.

In [None]:
# Creating a new feature for the visualization.

df['Character Count'] = df['title'].apply(lambda x: len(str(x)))


def plot_dist3(df, feature, title):
    # Creating a customized chart. and giving in figsize and everything.
    fig = plt.figure(constrained_layout=True, figsize=(18, 8))
    # Creating a grid of 3 cols and 3 rows.
    grid = gridspec.GridSpec(ncols=3, nrows=3, figure=fig)

    # Customizing the histogram grid.
    ax1 = fig.add_subplot(grid[0, :2])
    # Set the title.
    ax1.set_title('Histogram')
    # plot the histogram.
    sns.distplot(df.loc[:, feature],
                 hist=True,
                 kde=True,
                 ax=ax1,
                 color='#e74c3c')
    ax1.set(ylabel='Frequency')
    ax1.xaxis.set_major_locator(MaxNLocator(nbins=20))

    # Customizing the ecdf_plot.
    ax2 = fig.add_subplot(grid[1, :2])
    # Set the title.
    ax2.set_title('Empirical CDF')
    # Plotting the ecdf_Plot.
    sns.distplot(df.loc[:, feature],
                 ax=ax2,
                 kde_kws={'cumulative': True},
                 hist_kws={'cumulative': True},
                 color='#e74c3c')
    ax2.xaxis.set_major_locator(MaxNLocator(nbins=20))
    ax2.set(ylabel='Cumulative Probability')

    # Customizing the Box Plot.
    ax3 = fig.add_subplot(grid[:, 2])
    # Set title.
    ax3.set_title('Box Plot')
    # Plotting the box plot.
    sns.boxplot(x=feature, data=df, orient='v', ax=ax3, color='#e74c3c')
    ax3.yaxis.set_major_locator(MaxNLocator(nbins=25))

    plt.suptitle(f'{title}', fontsize=24)

In [None]:
plot_dist3(df[df['clickbait'] == 0], 'Character Count',
           'Characters Per "Non Bait" Title')

In [None]:
plot_dist3(df[df['clickbait'] == 1], 'Character Count',
           'Characters Per "Clickbait" Title')

## Word Counts

### Here we observe:

- It's pretty similar with character counts we seen before.
- Bait titles having much less words in their sentences.
- Meanwhile actual news titles are much richer in terms of word counts.

In [None]:
def plot_word_number_histogram(textno, textye):
    
    """A function for comparing word counts"""

    fig, axes = plt.subplots(figsize=(18, 6), sharey=True)
    sns.kdeplot(textno.str.split().map(lambda x: len(x)),shade=True,color='#e74c3c')
    sns.kdeplot(textye.str.split().map(lambda x: len(x)),shade=True)
    
    plt.xlabel('Word Count')
    plt.ylabel('Frequency')
    plt.legend(['Non-Bait','Bait'])
    fig.suptitle('Words Per Title', fontsize=24, va='baseline')
    
    fig.tight_layout()

In [None]:
plot_word_number_histogram(df[df['clickbait'] == 0]['title'],
                           df[df['clickbait'] == 1]['title'])

# Tokenization with Sadedegel

#### SadedeGel's main building block is called 'Doc' it only needs a string format input and then turns it to SadedeGel object for future use.

#### Here we loaded randomly selected news article title with SadedeGel, then tokenized sentences. You can see it's doing pretty good on Turkish syntax.

#### You can choose various tokenizers built-in SadedeGel, we went with the default one for this example.

#### Finally we tokenized whole dataset and stored them in corpuses for future analysis.

In [None]:
# loading sample title

document = Doc(df.iloc[5003]['title'])
document

In [None]:
# tokenizing the sample using sadedegel tokenizer

for sentence in document:
    print(sentence.tokens)

In [None]:
# tokenizing the clickbait data using sadedegel tokenizer

bait = df[df.clickbait==1.0]['title']
bait_corpus = []

for title in tqdm(bait):
    d = Doc(title)
    w = [i.tokens for i in d]
    bait_corpus.append(list(itertools.chain.from_iterable(w)))
bait_corpus=list(itertools.chain.from_iterable(bait_corpus))
bait_corpus=[tr_lower(i) for i in bait_corpus]

In [None]:
# tokenizing the non-bait data using sadedegel tokenizer

no_bait = df[df.clickbait!=1.0]['title']
nb_corpus = []

for title in tqdm(no_bait):
    d = Doc(title)
    w = [i.tokens for i in d]
    nb_corpus.append(list(itertools.chain.from_iterable(w)))
nb_corpus=list(itertools.chain.from_iterable(nb_corpus))
nb_corpus=[tr_lower(i) for i in nb_corpus]

#### Here we cleaned our corpus using SadedeGel's default puncts list and manually updated the list with some specific cases for this instance.

In [None]:
%%time

# filtering out some tokens from bait texts for cleaner results

# loading default puncts list from sadedegel and manually adding some specific terms to filter out
spec = list(puncts)
spec+=['’','…','‘','bir','nin','nın','ın','in','den','dan','ten','tan','ye','ya','e','a','de','da','te','ta']

filtered_tokens = [token for token in bait_corpus if token not in stop_word_list]
b_ht=[]
for i in filtered_tokens:
    if i.startswith('#'):
        b_ht.append(i)


filtered_tokens = [token for token in filtered_tokens if token not in b_ht]
filtered_tokens = [token for token in filtered_tokens if token not in spec]

In [None]:
%%time

# filtering out some tokens from non-bait texts for cleaner results

filtered_tokens_nb = [token for token in nb_corpus if token not in stop_word_list]
nb_ht=[]
for i in filtered_tokens_nb:
    if i.startswith('#'):
        nb_ht.append(i)

filtered_tokens_nb = [token for token in filtered_tokens_nb if token not in spec]
filtered_tokens_nb = [token for token in filtered_tokens_nb if token not in b_ht]

In [None]:
# counting most common bait tokens

counter = Counter(filtered_tokens)
most = counter.most_common()
x_b, y_b = [], []
for word, count in most[:20]:
    x_b.append(word)
    y_b.append(count)

In [None]:
# counting most common non-bait tokens

counter_nb = Counter(filtered_tokens_nb)
most_nb = counter_nb.most_common()
x_nb, y_nb = [], []
for word, count in most_nb[:20]:
    x_nb.append(word)
    y_nb.append(count)

# Most Common Words

#### Here we can see there's huge difference between clickbait and actualy news titles in terms of word counts. You can easily see words like "son, dakika" are much more likely in clickbait titles. Where these words means something like "Breaking News" or "Newsflash"...

In [None]:
# plotting most common tokens for bait/non_bait

fig, ax = plt.subplots(1,2,figsize=(18, 6))
sns.barplot(x=y_b, y=x_b, palette='plasma', ax=ax[1])
sns.barplot(x=y_nb, y=x_nb, palette='plasma', ax=ax[0])
ax[0].set_title('Non_Bait')
ax[1].set_title('Bait')
plt.suptitle('Word Counts')
plt.show()

# WordCloud for News Titles

#### Again using our corpus which created by using SadedeGel tokenizers I wanted to visualize most common words using WordCloud. This package created by Andreas Mueller and it's pretty cool!

In [None]:
def plot_wordcloud(text, title, title_size):
    """ A function for creating wordcloud images """
    allwords = text
    mostcommon = FreqDist(allwords).most_common(140)
    wordcloud = WordCloud(
        width=1200,
        height=800,
        background_color='black',
        max_words=150,
        scale=3,        
        contour_width=0.1,
        contour_color='grey',
    ).generate(str(mostcommon))    

    def grey_color_func(word,
                        font_size,
                        position,
                        orientation,
                        random_state=None,
                        **kwargs):
        # A definition for creating grey color shades.
        return 'hsl(0, 0%%, %d%%)' % random.randint(60, 100)

    fig = plt.figure(figsize=(18, 18), facecolor='white')
    plt.imshow(wordcloud.recolor(color_func=grey_color_func, random_state=42),
               interpolation='bilinear')
    plt.axis('off')
    plt.title(title,
              fontdict={
                  'size': title_size,
                  'verticalalignment': 'bottom'
              })
    plt.tight_layout(pad=0)
    plt.show()

In [None]:
plot_wordcloud(filtered_tokens_nb,
               'Most Common Words in Non-Bait Titles',
               title_size=30)

In [None]:
plot_wordcloud(filtered_tokens,
               'Most Common Words in Bait Titles',
               title_size=30)

# Modelling

#### Finally classification time! Here we going to build a model to predict given title's clickbait status.

In [None]:
# loading some packages for modelling

from sklearn.model_selection import cross_validate, StratifiedKFold, cross_val_score, train_test_split
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, plot_confusion_matrix

from scipy.sparse import vstack, csr_matrix

# Model With TFIDF Embeddings

#### We're going to get tfidf embeddings, which is sparse representation for texts. Again we're going to use SadedeGel's built-in embedding extractors. It's pretty simple to use, we get Doc object for every news title in our data and then turn them into sparse matrix format using tfidf embeddings. Then we get average  embeddings for each sentence to get single represantation of the news title.

#### In second part I just converted sparse matrix to dataframe format for showing what we did here. On x axis you can see the number of observations (number of titles) and on y axis you can see the vocabulary size which is 27744 for now...

In [None]:
# getting tfidf embeddings using sadedegel library

X_tf = df['title']
X = []

for title in tqdm(X_tf):
    d = Doc(title)
    X.append(csr_matrix(d.tfidf_embeddings.mean(axis=0)))


X = vstack(X)
print('Shape of Embeddings: ', X.shape)

In [None]:
# converting sparse matrix to dataframe for explaining puposes (no use in training)

sample=pd.DataFrame.sparse.from_spmatrix(X)
sample.sample(5)

In [None]:
# getting target values and train-test splitting for validation

y = df['clickbait']
X_train, X_val, y_train, y_val = train_test_split(X,y,stratify=y, test_size=0.2, random_state=seed)

### Here we choose some basic classifiers to test them...

In [None]:
# Selecting some classifiers:

logreg = LogisticRegression(random_state=seed)

dectree = DecisionTreeClassifier(random_state=seed)

knclass = KNeighborsClassifier()

light = lgb.LGBMClassifier(random_state=seed)

In [None]:
# Setting 5 fold CV:

cv = StratifiedKFold(5, shuffle=True, random_state=seed)
classifiers = [logreg,dectree, knclass, light]

In [None]:
def model_check(X, y, classifiers, cv):
    
    ''' A function for testing multiple classifiers and return several metrics. '''
    
    model_table = pd.DataFrame()

    row_index = 0
    for cls in classifiers:

        MLA_name = cls.__class__.__name__
        model_table.loc[row_index, 'Model Name'] = MLA_name
        
        cv_results = cross_validate(
            cls,
            X,
            y,
            cv=cv,
            scoring=('accuracy','f1','roc_auc'),
            return_train_score=True,
            n_jobs=-1
        )
        model_table.loc[row_index, 'Train Roc/AUC Mean'] = cv_results[
            'train_roc_auc'].mean()
        model_table.loc[row_index, 'Test Roc/AUC Mean'] = cv_results[
            'test_roc_auc'].mean()
        model_table.loc[row_index, 'Test Roc/AUC Std'] = cv_results['test_roc_auc'].std()
        model_table.loc[row_index, 'Train Accuracy Mean'] = cv_results[
            'train_accuracy'].mean()
        model_table.loc[row_index, 'Test Accuracy Mean'] = cv_results[
            'test_accuracy'].mean()
        model_table.loc[row_index, 'Test Acc Std'] = cv_results['test_accuracy'].std()
        model_table.loc[row_index, 'Train F1 Mean'] = cv_results[
            'train_f1'].mean()
        model_table.loc[row_index, 'Test F1 Mean'] = cv_results[
            'test_f1'].mean()
        model_table.loc[row_index, 'Test F1 Std'] = cv_results['test_f1'].std()
        model_table.loc[row_index, 'Time'] = cv_results['fit_time'].mean()

        row_index += 1        

    model_table.sort_values(by=['Test F1 Mean'],
                            ascending=False,
                            inplace=True)

    return model_table

## Sadedegel TFIDF Results

#### Alright! The results are here, lets take a look...

#### They look pretty decent! I think checking f1 score on this case is more logical since we don't want many fp/fn's. Our top two f1 scorers are Logistic Regression and LGBM classifiers. But when you take a closer look you can see that LogisticRegression kinda overfitting meanwhile default LGBM looks much better, let's fit that on our train set and then test it on our validation set!

In [None]:
%%time

raw_models = model_check(X_train, y_train, classifiers, cv)
display(raw_models)

In [None]:
light.fit(X_train, y_train)
y_pred = light.predict(X_val)

In [None]:
# Testing models on non-seen data

print('Accuracy:', accuracy_score(y_val, y_pred))
print('F1:', f1_score(y_val, y_pred))

#### Validation looks good. Let's carry on with the confusion matrix so you can see true positive rate, false positive rate etc. easier...

In [None]:
def conf_mat(X,y, classifiers):
    
    ''' A function for displaying confusion matrices'''
    
    fig, axes = plt.subplots(2,2, figsize=(12,8))
    
    axes = axes.flatten()

    for ax, classifier in zip(axes, classifiers):
        classifier.fit(X,y)
        plot_confusion_matrix(classifier, X, y,
                                         values_format = 'n',
                                         display_labels = ['Non_Bait', 'Clickbait'],
                                         cmap='summer_r',ax=ax)
        ax.set_title(f'{classifier.__class__.__name__}')
        ax.grid(False)
        plt.tight_layout()

In [None]:
conf_mat(X_train, y_train, classifiers)

## Model Results Using Sklearn TFIDF Embeddings for Benchmarking

#### I just wanted to check sklearn tfidf results with same models to compare it with SadedeGel's embeddings. When we check the results we can see that SadedeGel's tfidf embeddings worked better on Turkish texts almost on every classifier. Pretty cool!

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
Xi = vectorizer.fit_transform(df.title)

In [None]:
%%time

Xi_train, Xi_val, yi_train, yi_val = train_test_split(Xi,y,stratify=y, test_size=0.2, random_state=seed)

raw_models = model_check(Xi_train, yi_train, classifiers, cv)
display(raw_models)

# Model With Sadedegel Bert Embeddings

#### SadedeGel also comes with another embedding extractor which is getting popular in NLP tasks lately: 'BERT'. It's little bit slower than getting tfidf embeddings but it's much more stronger indicator for models to make predictions. For performance issues I just take 2000 random titles from our data and get BERT embeddings.

In [None]:
# randomly sampling 2000 observations

df = df.sample(2000)

X_bt = df['title']
X = []

# getting embeddings

for title in tqdm(X_bt):
    d = Doc(title)
    X.append(csr_matrix(d.bert_embeddings.mean(axis=0)))


X = vstack(X)
print('Shape of Embeddings: ', X.shape)

In [None]:
y = df['clickbait']
Xb_train, Xb_val, yb_train, yb_val = train_test_split(X,y,stratify=y, test_size=0.2)

In [None]:
light.fit(Xb_train, yb_train)
yb_pred = light.predict(Xb_val)

## BERT Embedding Results

#### Oh nice! We only used 10% of the data we used for tfidf and almost got similar scores for our classifier. So if you want higher scores and have computing power you can go with SadedeGel's BERT embeddings!

In [None]:
# bert results

print('Accuracy:', accuracy_score(yb_val, yb_pred))
print('F1:', f1_score(yb_val, yb_pred))

In [None]:
conf_mat(Xb_train, yb_train, classifiers)

# Testing the Classifier on Randomly Selected News Titles Found on Web

#### Here I gathered some titles from various news sites which posted recently. When we execute our classifier it gives us 1 for clickbait title, 0 for actual title. The results are looking promising for me, you can check them yourself too or add some other titles into 'titles' list to test them for yourself. 

In [None]:
# some random titles from various news sites

titles = ["Öyle bir değişim geçirdi ki",
          "Grip aşısından bir hafta önce az uyumak, aşının etkisini yüzde 50 azaltıyor",
          "Son hali yürek burkuyor",
          "Covid-19 aşısı bulundu",          
          "Yunanistan, Türkiye sınırınında güvenlik önlemlerini artırıyor: Duvar, kameralar ve daha çok sınır muhafızı",
          "Son Dakika | Mesut Özil'den Arsenal açıklaması!",
          "Son dakika haberi: Azerbaycan ordusu Ermenistan'a ağır darbe vurdu! Bir tabur asker..",
          "Türkiye'nin 100 yıllık enerjisini karşılayacak dev rezerv!",
          "Türkiye'nin en yüksek barajında yüzde 87'lik fiziki gerçekleşme sağlandı",
          "Son dakika haberi: Dünya bunu tartışıyor: 'Uzun Kovid' kimleri vuruyor, uzmanlar açıkladı!",
          "Trump'ın vergi kayıtları Çin'le iş bağlantılarını gösteriyor"

]

title_embeds = []
for title in titles:
    d = Doc(title)
    title_embeds.append(csr_matrix(d.tfidf_embeddings.mean(axis=0)))


title_embeds = vstack(title_embeds)

light.fit(X_train, y_train)

preds = light.predict(title_embeds)

In [None]:
d = {'title':titles, 'clickbait':preds}

pd.DataFrame(d)

# Final Words

### That concludes my notebook here. I wanted to present you 'SadedeGel' for Turkish text classification tasks and I'd say it did pretty good work in my first try. Hope you find it useful too.

### Thanks for reading and happy coding all!