# Fake or not? - DistilBERT

You also can check this notebook [here](https://www.kaggle.com/code/algord/fake-or-not-distilbert).
(New versions will be posted there more often.)

In this notebook, I will try to classify [news data](https://www.kaggle.com/datasets/algord/fake-news) as fake or real. This is the preliminary stage of my scientific work on the prediction of the spread of news and information.

So, then it is planned to use this model as a baseline for comparison with other algorithms.

In [1]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [3]:
df = pd.read_csv('../input/fake-news/FakeNewsNet.csv')
df.head()

Unnamed: 0,title,news_url,source_domain,tweet_num,real
0,Kandi Burruss Explodes Over Rape Accusation on...,http://toofab.com/2017/05/08/real-housewives-a...,toofab.com,42,1
1,People's Choice Awards 2018: The best red carp...,https://www.today.com/style/see-people-s-choic...,www.today.com,0,1
2,Sophia Bush Sends Sweet Birthday Message to 'O...,https://www.etonline.com/news/220806_sophia_bu...,www.etonline.com,63,1
3,Colombian singer Maluma sparks rumours of inap...,https://www.dailymail.co.uk/news/article-33655...,www.dailymail.co.uk,20,1
4,Gossip Girl 10 Years Later: How Upper East Sid...,https://www.zerchoo.com/entertainment/gossip-g...,www.zerchoo.com,38,1


In [4]:
df = df[:2000]

In [5]:
df['real'].value_counts()

1    1521
0     479
Name: real, dtype: int64

In [6]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights).to(device)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
tokenized = df['title'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [8]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [9]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 50)

In [10]:
input_ids = torch.tensor(np.array(padded)).to(device)

with torch.no_grad():
    last_hidden_states = model(input_ids)

In [11]:
features = last_hidden_states[0][:,0,:].cpu().numpy()

In [12]:
labels = df['real']
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [13]:
lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(train_features, train_labels)

LogisticRegression(max_iter=1000)

In [14]:
lr_clf.score(test_features, test_labels)

0.82