# Fake news detection

In [3]:
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd

## Preprocessing

Importiamo il dataset di fake news, questo dataset è composto da oltre 70 mila articoli ma per questo esperimento ne utilizziamo solamente 15 mila per ottimizzare i tempi e risparmiare risorse. 

Il dataset contiene un campo text contentente il testo dell'articolo e una label con i seguenti valori:
- 0: fake
- 1: real

Il dataset è disponibile a [questo link](https://drive.google.com/file/d/1dLMKyEB3JcP-BP2F2OvqlwS2SD6UOQ8x/view?usp=sharing).

In [3]:
import pandas as pd

df = pd.read_csv('WELFake_Dataset.csv')[:15000]

# Copio la colonna text per preprocessing
df['original_text'] = df['text']

Rimuovo gli articoli più corti

In [4]:
df['text_wc'] = df['text'].apply(lambda x:len(str(x).split()))
df = df[df['text_wc'] > 20]

Mantengo solo le colonne utili

In [5]:
df = df[[
    'label',
    'text',
    'original_text'
]]

Rimuovo gli NA

In [6]:
df = df[df['label'].notna()]
df = df[df['text'].notna()]

Applica lemmatizzazione, rimuove punteggiatura, stopwords, valute, numeri, tabulazioni e like num (numeri romani o in forma testuale) Usare modelli large o transformer per migliorare risultato: https://spacy.io/models/en

Per il preprocessing usiamo il modello large di spacy *it_core_news_lg* che, nonostante sia pre-trainato su news è perfetto per task come lemmatization e punctuation removal

Swifter usato per migliorare le prestazioni di DataFrame.apply()

In [7]:
import spacy
import swifter

nlp = spacy.load('en_core_web_md')

df['text'] = df.original_text.swifter.apply(lambda text: " ".join([token.lemma_ for token in nlp(text) if 
                                             not token.is_punct 
                                             and not token.is_currency
                                             and not token.is_digit
                                             and not token.is_space
                                             and not token.is_stop
                                             and not token.like_num
                                             ]))

2023-01-10 12:07:04.523048: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-01-10 12:07:04.578962: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-01-10 12:07:04.579369: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.


Pandas Apply:   0%|          | 0/14593 [00:00<?, ?it/s]

Salviamo eventualmente il dataset processato in modo da non dover ripetere tutto il processo

In [12]:
df.to_csv('processed_fnews_15k.csv')

In [4]:
df = pd.read_csv('processed_fnews_15k.csv')

### Statistiche sul dataset

In [8]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,label,text,original_text
0,0,1,comment expect Barack Obama member FYF911 FukY...,No comment is expected from Barack Obama Membe...
1,2,1,demonstrator gather night exercise constitutio...,"Now, most of the demonstrators gathered last ..."
2,3,0,dozen politically active pastor come private d...,A dozen politically active pastors came here f...
3,4,1,rs-28 Sarmat missile dub Satan replace SS-18 f...,"The RS-28 Sarmat missile, dubbed Satan 2, will..."
4,5,1,s time sue Southern Poverty Law Center!On Tues...,All we can say on this one is it s about time ...
5,8,1,owner Ringling Bar locate south White Sulphur ...,"The owner of the Ringling Bar, located south o..."
6,9,1,file Sept. file photo marker welcome commuter ...,"FILE – In this Sept. 15, 2005 file photo, the ..."
7,10,1,punchable alt right Nazi internet get thorough...,The most punchable Alt-Right Nazi on the inter...
8,11,0,BRUSSELS Reuters british Prime Minister Theres...,BRUSSELS (Reuters) - British Prime Minister Th...
9,12,0,WASHINGTON Reuters Charles Schumer Democrat U....,"WASHINGTON (Reuters) - Charles Schumer, the to..."


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14593 entries, 0 to 14592
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     14593 non-null  int64 
 1   label          14593 non-null  int64 
 2   text           14593 non-null  object
 3   original_text  14593 non-null  object
dtypes: int64(2), object(2)
memory usage: 456.2+ KB


In [10]:
df.label.value_counts()

1    7497
0    7096
Name: label, dtype: int64

## Preparazione dati

Converte il testo in sequenze di interi

In [5]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(df['text'].tolist())
sequences = tokenizer.texts_to_sequences(df['text'].tolist())

Setto parametri per embedding

In [6]:
max_length = max([len(seq) for seq in sequences])
vocab_size = len(tokenizer.word_index) + 1

Aggiungo padding alla fine in modo che tutte le sequenze siano di uguale lunghezza

In [7]:
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding='post', maxlen=max_length)

Splitto il dataset in train, test e validation set

In [8]:
x_train, x_test, y_train, y_test = train_test_split(padded_sequences, df.label, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.18)

## Creazione modello

In [11]:
model = tf.keras.Sequential()

# Embedding layer
model.add(layers.Embedding(input_dim=vocab_size, output_dim=100, input_length=max_length))

# Convolutional layer: 128 filters, kernel_size = 5
model.add(layers.Conv1D(128, 5, activation='relu'))

# Max pooling layer per ridurre overfitting
model.add(tf.keras.layers.GlobalMaxPooling1D())

# Dropout layer per ridurre overfitting
model.add(tf.keras.layers.Dropout(0.5))

# Flatten layer per ridurre overfitting
model.add(layers.Flatten())

# Add a dense layer with 32 units and a relu activation function
model.add(layers.Dense(32, activation='relu'))

# Add an output layer with a sigmoid activation function
model.add(layers.Dense(2, activation='sigmoid'))

In [12]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Training

In [13]:
model.fit(x_train, y_train, epochs=5, batch_size=64, validation_data=(x_val, y_val))

Epoch 1/5


2023-01-10 12:46:04.026340: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 795203472 exceeds 10% of free system memory.
2023-01-10 12:46:07.692287: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8100
2023-01-10 12:46:09.055097: W tensorflow/tsl/framework/bfc_allocator.cc:290] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.49GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2023-01-10 12:46:09.055184: W tensorflow/tsl/framework/bfc_allocator.cc:290] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.49GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2023-01-10 12:46:10.417247: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x250d6590 initialized for platform CUDA (this

Epoch 2/5


2023-01-10 12:47:00.027952: W tensorflow/tsl/framework/bfc_allocator.cc:290] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.10GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2023-01-10 12:47:00.030964: W tensorflow/tsl/framework/bfc_allocator.cc:290] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.10GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.


Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7ff2f9bde910>

## Valutazione

In [24]:
test_loss, test_acc = model.evaluate(x_test, y_test)
print('Test accuracy:', test_acc)

Test accuracy: 0.9554641842842102


## Classificazione di nuove notizie

In [25]:
sample_text = 'Luigi, Italian Politician, Turns into Giant Tomato \
In a bizarre turn of events, local Italian politician Luigi has transformed into a giant tomato. \
Eyewitnesses reported that during a heated debate in the town hall, Luigi suddenly began to grow and change color, \
until he was a fully ripe tomato, standing at least 15 feet tall. \
Despite his new form, Luigi is said to be in good spirits and continues to conduct official business as normal, \
holding meetings and signing documents with his newly formed tomato vines. His constituents are in shock but also delight, \
they have never seen something like this before.'

In [26]:
import spacy
import swifter
import numpy as np

nlp = spacy.load('en_core_web_md')

sample_text = " ".join([token.lemma_ for token in nlp(sample_text) if 
                                             not token.is_punct 
                                             and not token.is_currency
                                             and not token.is_digit
                                             and not token.is_space
                                             and not token.is_stop
                                             and not token.like_num
                                             ])

sequences = tokenizer.texts_to_sequences([sample_text])
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding='post', maxlen=max_length)

predictions = model.predict(padded_sequences)
prediction_npa = np.asarray(predictions[0])
predicted_class = np.argmax(prediction_npa)

if predicted_class == 0:
    print('[ FAKE ]')
else:
    print('[ REAL ]')

[ FAKE ]
