# Filtro per Fake News
Il problema delle fake news è cresciuto esponenzialmente nell'ultimo decennio a causa della crescente diffusione dei social network, il governo degli Stati Uniti ha deciso di muoversi a tal proposito, incaricando la tua azienda di realizzare un plug-in per chrome in grado di riconoscere se una notizia è falsa. Il tuo compito è quello di realizzare il modello in grado di riconoscere le notizie false, che poi il team di machine learning enginner e web developer metterà in produzione. Ti vengono messi a disposizioni due raccolte di notizie, una contenente solo notizie false e l'altra contenente solo notizie vere, utilizzale per addestrare il tuo modello.

### [Link al dataset](https://proai-datasets.s3.eu-west-3.amazonaws.com/fake_news.zip)

Parti da un'accurata analisi, rispondendo a domande come:
- le fake news sono più frequenti in una determinata categoria?
- per ogni categoria, ci sono argomenti che sono più soggetti alle fake news?
- I titoli delle fake news presentano dei pattern?

Una volta addestrato il modello esportalo [utilizzando pickle](https://scikit-learn.org/stable/model_persistence.html) così che i tuoi colleghi possano metterlo in produzione.

## Importiamo i dataset

In [None]:
!wget https://proai-datasets.s3.eu-west-3.amazonaws.com/fake_news.zip
!unzip fake_news.zip

--2024-06-11 18:53:32--  https://proai-datasets.s3.eu-west-3.amazonaws.com/fake_news.zip
Resolving proai-datasets.s3.eu-west-3.amazonaws.com (proai-datasets.s3.eu-west-3.amazonaws.com)... 3.5.225.182, 16.12.18.38
Connecting to proai-datasets.s3.eu-west-3.amazonaws.com (proai-datasets.s3.eu-west-3.amazonaws.com)|3.5.225.182|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42975911 (41M) [application/zip]
Saving to: ‘fake_news.zip.1’


2024-06-11 18:53:34 (20.7 MB/s) - ‘fake_news.zip.1’ saved [42975911/42975911]

Archive:  fake_news.zip
replace Fake.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: Fake.csv                
replace True.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: True.csv                


# Faccio del preprocessing

In [None]:
import pandas as pd

In [None]:
df_true = pd.read_csv("True.csv")
df_true.head()
len(df_true)

21417

In [None]:
df_fake = pd.read_csv("Fake.csv")
df_fake.head()
len(df_fake)

23481

In [None]:
# associo colonna 0= vero 1 = falso per i due dataset
df_true['flag']=0
df_fake['flag']=1
df_fake.head()


Unnamed: 0,title,text,subject,date,flag
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1


In [None]:
# unisco i due dataset e creo nuovo indice

df_merged=pd.concat([df_true,df_fake])
df_merged['id'] = range(len(df_merged))
df_merged.tail()

Unnamed: 0,title,text,subject,date,flag,id
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",1,44893
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",1,44894
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",1,44895
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",1,44896
23480,10 U.S. Navy Sailors Held by Iranian Military ...,21st Century Wire says As 21WIRE predicted in ...,Middle-east,"January 12, 2016",1,44897


In [None]:
# faccio uno shuffle dei dati

df_shuffled = df_merged.sample(frac=1)
df_shuffled.head(10)

Unnamed: 0,title,text,subject,date,flag,id
7048,Trump calls Green Party vote recount request a...,"WEST PALM BEACH, Fla. (Reuters) - U.S. Preside...",politicsNews,"November 26, 2016",0,7048
7539,"Secret Service Agents Jump On Stage, Surround...",Donald Trump was briefly surrounded by Secret ...,News,"March 12, 2016",1,28956
14067,"U.S. strikes on Taliban opium labs won't work,...","LASHKAR GAH, Afghanistan/KABUL (Reuters) - As ...",worldnews,"November 23, 2017",0,14067
21373,EU citizens leaving UK pushes down net migrati...,LONDON (Reuters) - Net migration to Britain fe...,worldnews,"August 24, 2017",0,21373
8711,Tim Allen Cracks A Joke About Obama – Would B...,Self-proclaimed fiscal-conservative Tim Alle...,News,"January 17, 2016",1,30128
2221,Senator Grassley not expecting imminent Suprem...,WASHINGTON (Reuters) - The head of the U.S. Se...,politicsNews,"August 11, 2017",0,2221
12298,India PM Modi's party seen sweeping state poll...,NEW DELHI (Reuters) - Indian Prime Minister Na...,worldnews,"December 14, 2017",0,12298
3506,U.S. did not forewarn EU on climate deal: spok...,BRUSSELS (Reuters) - The European Union’s exec...,politicsNews,"May 31, 2017",0,3506
1559,Sally Yates Just Opened A Can Of Constitution...,During testimony in front of the Senate Intell...,News,"May 8, 2017",1,22976
4452,House won't vote on healthcare law before brea...,WASHINGTON (Reuters) - The U.S. House of Repre...,politicsNews,"April 5, 2017",0,4452


In [None]:
# faccio download pacchetti nltk
#!python -m nltk.downloader all

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

In [None]:
# Funzione per rimuovere stopwords, punteggiatura,  token vuoti e token specifici, lowercasing e lemmatizzare

import string
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
import string

english_stopwords = stopwords.words('english')
nlp = spacy.load('en_core_web_sm')

# Set di token specifici da rimuovere
unwanted_tokens = {"'s", "-", "—", "–"}



def preprocess(text):
    doc = nlp(text.lower())  # Convertire il testo in minuscolo
    filtered_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and token.text.strip() and token.text not in unwanted_tokens and len(token.text) >= 3]
    return filtered_tokens



In [None]:
# Applico la funzione di preprocessamento
df_shuffled['tokenized_title'] = df_shuffled['title'].apply(preprocess)

In [None]:
df_shuffled

Unnamed: 0,title,text,subject,date,flag,id,tokenized_title
7048,Trump calls Green Party vote recount request a...,"WEST PALM BEACH, Fla. (Reuters) - U.S. Preside...",politicsNews,"November 26, 2016",0,7048,"[trump, call, green, party, vote, recount, req..."
7539,"Secret Service Agents Jump On Stage, Surround...",Donald Trump was briefly surrounded by Secret ...,News,"March 12, 2016",1,28956,"[secret, service, agent, jump, stage, surround..."
14067,"U.S. strikes on Taliban opium labs won't work,...","LASHKAR GAH, Afghanistan/KABUL (Reuters) - As ...",worldnews,"November 23, 2017",0,14067,"[u.s, strike, taliban, opium, lab, work, afgha..."
21373,EU citizens leaving UK pushes down net migrati...,LONDON (Reuters) - Net migration to Britain fe...,worldnews,"August 24, 2017",0,21373,"[citizen, leave, push, net, migration, brexit,..."
8711,Tim Allen Cracks A Joke About Obama – Would B...,Self-proclaimed fiscal-conservative Tim Alle...,News,"January 17, 2016",1,30128,"[tim, allen, crack, joke, obama, destroy, potu..."
...,...,...,...,...,...,...,...
16417,HOW DID THE FBI NOT REPORT THIS? Devastating F...,Former NSA officer John Schindler reports on a...,Government News,"Sep 6, 2016",1,37834,"[fbi, report, devastating, fact, mention, fbi,..."
3655,U.S. Supreme Court leaves key campaign finance...,WASHINGTON (Reuters) - The U.S. Supreme Court ...,politicsNews,"May 22, 2017",0,3655,"[u.s, supreme, court, leave, key, campaign, fi..."
21021,Voice of triumph or doom: North Korean present...,SEOUL (Reuters) - Wearing a pink Korean dress ...,worldnews,"September 4, 2017",0,21021,"[voice, triumph, doom, north, korean, presente..."
12542,BREAKING: HILLARY’S CAMPAIGN CHAIRMAN On Close...,It s good to know Hillary s Campaign Chairman ...,politics,"Nov 1, 2016",1,33959,"[break, hillary, campaign, chairman, close, fr..."


In [None]:
# Applico la funzione di preprocessamento
df_shuffled['tokenized_text'] = df_shuffled['text'].apply(preprocess)

In [None]:
df_shuffled

Unnamed: 0,title,text,subject,date,flag,id,tokenized_title,tokenized_text
7048,Trump calls Green Party vote recount request a...,"WEST PALM BEACH, Fla. (Reuters) - U.S. Preside...",politicsNews,"November 26, 2016",0,7048,"[trump, call, green, party, vote, recount, req...","[west, palm, beach, fla, reuters, u.s, preside..."
7539,"Secret Service Agents Jump On Stage, Surround...",Donald Trump was briefly surrounded by Secret ...,News,"March 12, 2016",1,28956,"[secret, service, agent, jump, stage, surround...","[donald, trump, briefly, surround, secret, ser..."
14067,"U.S. strikes on Taliban opium labs won't work,...","LASHKAR GAH, Afghanistan/KABUL (Reuters) - As ...",worldnews,"November 23, 2017",0,14067,"[u.s, strike, taliban, opium, lab, work, afgha...","[lashkar, gah, afghanistan, kabul, reuters, u...."
21373,EU citizens leaving UK pushes down net migrati...,LONDON (Reuters) - Net migration to Britain fe...,worldnews,"August 24, 2017",0,21373,"[citizen, leave, push, net, migration, brexit,...","[london, reuters, net, migration, britain, fal..."
8711,Tim Allen Cracks A Joke About Obama – Would B...,Self-proclaimed fiscal-conservative Tim Alle...,News,"January 17, 2016",1,30128,"[tim, allen, crack, joke, obama, destroy, potu...","[self, proclaim, fiscal, conservative, tim, al..."
...,...,...,...,...,...,...,...,...
16417,HOW DID THE FBI NOT REPORT THIS? Devastating F...,Former NSA officer John Schindler reports on a...,Government News,"Sep 6, 2016",1,37834,"[fbi, report, devastating, fact, mention, fbi,...","[nsa, officer, john, schindler, report, devast..."
3655,U.S. Supreme Court leaves key campaign finance...,WASHINGTON (Reuters) - The U.S. Supreme Court ...,politicsNews,"May 22, 2017",0,3655,"[u.s, supreme, court, leave, key, campaign, fi...","[washington, reuters, u.s, supreme, court, mon..."
21021,Voice of triumph or doom: North Korean present...,SEOUL (Reuters) - Wearing a pink Korean dress ...,worldnews,"September 4, 2017",0,21021,"[voice, triumph, doom, north, korean, presente...","[seoul, reuters, wear, pink, korean, dress, fl..."
12542,BREAKING: HILLARY’S CAMPAIGN CHAIRMAN On Close...,It s good to know Hillary s Campaign Chairman ...,politics,"Nov 1, 2016",1,33959,"[break, hillary, campaign, chairman, close, fr...","[good, know, hillary, campaign, chairman, extr..."


In [None]:
# salvo il processamento

from google.colab import drive
drive.mount('drive')

df_shuffled.to_csv('/content/drive/My Drive/nlp-cleaned.csv')

Mounted at drive


In [None]:
# carico il preprocessamento

file_path = '/content/drive/My Drive/Colab Notebooks/nlp-cleaned.csv'
df_cleaned = pd.read_csv(file_path)
df_cleaned

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,flag,id,tokenized_title,tokenized_text
0,7048,Trump calls Green Party vote recount request a...,"WEST PALM BEACH, Fla. (Reuters) - U.S. Preside...",politicsNews,"November 26, 2016",0,7048,"['trump', 'call', 'green', 'party', 'vote', 'r...","['west', 'palm', 'beach', 'fla', 'reuters', 'u..."
1,7539,"Secret Service Agents Jump On Stage, Surround...",Donald Trump was briefly surrounded by Secret ...,News,"March 12, 2016",1,28956,"['secret', 'service', 'agent', 'jump', 'stage'...","['donald', 'trump', 'briefly', 'surround', 'se..."
2,14067,"U.S. strikes on Taliban opium labs won't work,...","LASHKAR GAH, Afghanistan/KABUL (Reuters) - As ...",worldnews,"November 23, 2017",0,14067,"['u.s', 'strike', 'taliban', 'opium', 'lab', '...","['lashkar', 'gah', 'afghanistan', 'kabul', 're..."
3,21373,EU citizens leaving UK pushes down net migrati...,LONDON (Reuters) - Net migration to Britain fe...,worldnews,"August 24, 2017",0,21373,"['citizen', 'leave', 'push', 'net', 'migration...","['london', 'reuters', 'net', 'migration', 'bri..."
4,8711,Tim Allen Cracks A Joke About Obama – Would B...,Self-proclaimed fiscal-conservative Tim Alle...,News,"January 17, 2016",1,30128,"['tim', 'allen', 'crack', 'joke', 'obama', 'de...","['self', 'proclaim', 'fiscal', 'conservative',..."
...,...,...,...,...,...,...,...,...,...
44893,16417,HOW DID THE FBI NOT REPORT THIS? Devastating F...,Former NSA officer John Schindler reports on a...,Government News,"Sep 6, 2016",1,37834,"['fbi', 'report', 'devastating', 'fact', 'ment...","['nsa', 'officer', 'john', 'schindler', 'repor..."
44894,3655,U.S. Supreme Court leaves key campaign finance...,WASHINGTON (Reuters) - The U.S. Supreme Court ...,politicsNews,"May 22, 2017",0,3655,"['u.s', 'supreme', 'court', 'leave', 'key', 'c...","['washington', 'reuters', 'u.s', 'supreme', 'c..."
44895,21021,Voice of triumph or doom: North Korean present...,SEOUL (Reuters) - Wearing a pink Korean dress ...,worldnews,"September 4, 2017",0,21021,"['voice', 'triumph', 'doom', 'north', 'korean'...","['seoul', 'reuters', 'wear', 'pink', 'korean',..."
44896,12542,BREAKING: HILLARY’S CAMPAIGN CHAIRMAN On Close...,It s good to know Hillary s Campaign Chairman ...,politics,"Nov 1, 2016",1,33959,"['break', 'hillary', 'campaign', 'chairman', '...","['good', 'know', 'hillary', 'campaign', 'chair..."


## Domanda 1: le fake news sono più frequenti in una determinata categoria?

In [None]:
df_filtered = df_cleaned[df_cleaned['flag'] == 1]
value_counts = df_filtered['subject'].value_counts()
total_occurrences = len(df_filtered)
relative_frequencies = value_counts / total_occurrences
for value, frequency in relative_frequencies.items():
    print(f"Valore: {value}, Frequenza relativa: {frequency:.4f}")

Valore: News, Frequenza relativa: 0.3854
Valore: politics, Frequenza relativa: 0.2913
Valore: left-news, Frequenza relativa: 0.1899
Valore: Government News, Frequenza relativa: 0.0669
Valore: US_News, Frequenza relativa: 0.0333
Valore: Middle-east, Frequenza relativa: 0.0331


### Possiamo notare come la categoria News sia la più soggetta a fake news

## Domanda 2 : per ogni categoria, ci sono argomenti che sono più soggetti alle fake news?

In [None]:
# analizzo la categoria News alla ricerca dei vari topic

df_test_News=df_filtered[df_filtered['subject'] == 'News']
documents_news=df_test_News['tokenized_title']+ " " + df_test_News['tokenized_text']
documents_news=(list(documents_news))


In [None]:
# creo dizionario per questa categoria

import gensim.corpora as corpora
import gensim as gensim

tokenized_doc_news = [doc.split() for doc in documents_news]

id2word=corpora.Dictionary(tokenized_doc_news)

corpus_news=[id2word.doc2bow(text)for text in tokenized_doc_news]

from pprint import pprint

num_topics=5

lda_model_news=gensim.models.LdaMulticore(corpus=corpus_news,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=3
                                     )
pprint(lda_model_news.print_topics())
doc_lda_news=lda_model_news[corpus_news]


[(0,
  '0.009*"\'wage\'," + 0.008*"\'warren\'," + 0.006*"\'minimum\'," + '
  '0.005*"\'january\'," + 0.004*"\'elizabeth\'," + 0.004*"\'gun\'," + '
  '0.004*"\'cdata\'," + 0.003*"\'clinton\'," + 0.003*"\'percent\'," + '
  '0.003*"\'raise\',"'),
 (1,
  '0.013*"\'trump\'," + 0.007*"\'president\'," + 0.006*"\'republican\'," + '
  '0.006*"\'say\'," + 0.006*"\'obama\'," + 0.005*"\'gun\'," + '
  '0.005*"\'election\'," + 0.005*"\'right\'," + 0.005*"\'vote\'," + '
  '0.005*"\'clinton\',"'),
 (2,
  '0.007*"\'state\'," + 0.006*"\'people\'," + 0.006*"\'say\'," + '
  '0.005*"\'year\'," + 0.004*"\'image\'," + 0.004*"\'law\'," + '
  '0.004*"\'time\'," + 0.003*"\'police\'," + 0.003*"\'right\'," + '
  '0.003*"\'go\',"'),
 (3,
  '0.051*"\'trump\'," + 0.010*"\'donald\'," + 0.009*"\'say\'," + '
  '0.009*"\'president\'," + 0.005*"\'go\'," + 0.005*"\'like\'," + '
  '0.005*"\'image\'," + 0.005*"\'people\'," + 0.005*"\'know\'," + '
  '0.005*"\'obama\',"'),
 (4,
  '0.012*"\'people\'," + 0.011*"\'trump\'," + 0.

## Per la categoria News distinguiamo come topic ricorrenti 'trump' - 'wage' - 'warren' - 'donald' - 'people'

In [None]:
# analizzo la categoria politics alla ricerca dei vari topic

df_test_politics=df_filtered[df_filtered['subject'] == 'politics']
documents_politics=df_test_politics['tokenized_title']+ " " + df_test_politics['tokenized_text']
documents_politics=(list(documents_politics))

In [None]:
# creo dizionario per questa categoria


tokenized_doc_politics = [doc.split() for doc in documents_politics]

id2word=corpora.Dictionary(tokenized_doc_politics)

corpus_politics=[id2word.doc2bow(text)for text in tokenized_doc_politics]

num_topics=5

lda_model_politics=gensim.models.LdaMulticore(corpus=corpus_politics,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=3
                                     )
pprint(lda_model_politics.print_topics())
doc_lda_politics=lda_model_politics[corpus_politics]

[(0,
  '0.018*"\'clinton\'," + 0.009*"\'state\'," + 0.008*"\'hillary\'," + '
  '0.005*"\'million\'," + 0.005*"\'say\'," + 0.005*"\'new\'," + '
  '0.004*"\'email\'," + 0.004*"\'department\'," + 0.004*"\'campaign\'," + '
  '0.004*"\'report\',"'),
 (1,
  '0.016*"\'trump\'," + 0.008*"\'president\'," + 0.008*"\'say\'," + '
  '0.005*"\'vote\'," + 0.004*"\'year\'," + 0.004*"\'people\'," + '
  '0.004*"\'like\'," + 0.003*"\'woman\'," + 0.003*"\'donald\'," + '
  '0.003*"\'know\',"'),
 (2,
  '0.007*"\'say\'," + 0.006*"\'police\'," + 0.006*"\'student\'," + '
  '0.006*"\'people\'," + 0.004*"\'white\'," + 0.004*"\'black\'," + '
  '0.004*"\'man\'," + 0.004*"\'trump\'," + 0.004*"\'like\'," + '
  '0.004*"\'muslim\',"'),
 (3,
  '0.012*"\'say\'," + 0.011*"\'obama\'," + 0.007*"\'president\'," + '
  '0.005*"\'u.s\'," + 0.005*"\'year\'," + 0.005*"\'trump\'," + '
  '0.004*"\'people\'," + 0.004*"\'country\'," + 0.004*"\'state\'," + '
  '0.004*"\'united\',"'),
 (4,
  '0.026*"\'trump\'," + 0.010*"\'say\'," + 0.

## Per la categoria politics i topic ricorrenti sono: 'clinton' - 'trump' - 'say' - 'obama' - 'hillary'

In [None]:
df_test_left_news=df_filtered[df_filtered['subject'] == 'left-news']
documents_left_news=df_test_left_news['tokenized_title']+ " " + df_test_left_news['tokenized_text']
documents_left_news=(list(documents_left_news))

In [None]:
# creo dizionario per questa categoria - ho aumentato il numero di passes perchè suggerito da Colab per aumentare la precisione


tokenized_doc_left_news = [doc.split() for doc in documents_left_news]

id2word=corpora.Dictionary(tokenized_doc_left_news)

corpus_left_news=[id2word.doc2bow(text)for text in tokenized_doc_left_news]

num_topics=5

lda_model_left_news=gensim.models.LdaMulticore(corpus=corpus_left_news,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=5
                                     )
pprint(lda_model_left_news.print_topics())
doc_lda_left_news=lda_model_left_news[corpus_left_news]

[(0,
  '0.009*"\'say\'," + 0.006*"\'muslim\'," + 0.005*"\'people\'," + '
  '0.004*"\'year\'," + 0.004*"\'state\'," + 0.004*"\'obama\'," + '
  '0.003*"\'white\'," + 0.003*"\'tell\'," + 0.003*"\'group\'," + '
  '0.003*"\'time\',"'),
 (1,
  '0.021*"\'trump\'," + 0.008*"\'say\'," + 0.007*"\'president\'," + '
  '0.005*"\'people\'," + 0.004*"\'police\'," + 0.004*"\'like\'," + '
  '0.004*"\'obama\'," + 0.003*"\'donald\'," + 0.003*"\'news\'," + '
  '0.003*"\'want\',"'),
 (2,
  '0.015*"\'trump\'," + 0.010*"\'say\'," + 0.008*"\'hillary\'," + '
  '0.008*"\'president\'," + 0.008*"\'clinton\'," + 0.007*"\'obama\'," + '
  '0.006*"\'vote\'," + 0.005*"\'state\'," + 0.005*"\'election\'," + '
  '0.004*"\'go\',"'),
 (3,
  '0.010*"\'black\'," + 0.008*"\'police\'," + 0.007*"\'clinton\'," + '
  '0.007*"\'say\'," + 0.005*"\'people\'," + 0.005*"\'year\'," + '
  '0.005*"\'hillary\'," + 0.005*"\'woman\'," + 0.004*"\'student\'," + '
  '0.004*"\'officer\',"'),
 (4,
  '0.009*"\'say\'," + 0.005*"\'year\'," + 0.005*

## Per la categoria left_news i topic ricorrenti sono : 'trump' - 'black' - 'say' - 'police' - 'clinton'

In [None]:
df_test_Government=df_filtered[df_filtered['subject'] == 'Government News']
documents_Government=df_test_Government['tokenized_title']+ " " + df_test_Government['tokenized_text']
documents_Government=(list(documents_Government))

In [None]:
# creo dizionario per questa categoria - ho aumentato il numero di passi per migliorare l'accuratezza come suggerito da Colab


tokenized_doc_Government = [doc.split() for doc in documents_Government]

id2word=corpora.Dictionary(tokenized_doc_Government)

corpus_Government=[id2word.doc2bow(text)for text in tokenized_doc_Government]

num_topics=5

lda_model_Government=gensim.models.LdaMulticore(corpus=corpus_Government,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=15
                                     )
pprint(lda_model_Government.print_topics())
doc_lda_Government=lda_model_Government[corpus_Government]

[(0,
  '0.006*"\'say\'," + 0.004*"\'government\'," + 0.003*"\'state\'," + '
  '0.003*"\'case\'," + 0.003*"\'people\'," + 0.003*"\'police\'," + '
  '0.003*"\'climate\'," + 0.003*"\'year\'," + 0.003*"\'court\'," + '
  '0.003*"\'change\',"'),
 (1,
  '0.009*"\'say\'," + 0.008*"\'trump\'," + 0.005*"\'president\'," + '
  '0.004*"\'obama\'," + 0.004*"\'people\'," + 0.004*"\'state\'," + '
  '0.004*"\'go\'," + 0.004*"\'right\'," + 0.003*"\'year\'," + '
  '0.003*"\'know\',"'),
 (2,
  '0.013*"\'clinton\'," + 0.010*"\'say\'," + 0.007*"\'state\'," + '
  '0.006*"\'email\'," + 0.006*"\'department\'," + 0.006*"\'hillary\'," + '
  '0.005*"\'president\'," + 0.005*"\'fbi\'," + 0.004*"\'obama\'," + '
  '0.004*"\'year\',"'),
 (3,
  '0.009*"\'say\'," + 0.007*"\'obama\'," + 0.006*"\'iran\'," + '
  '0.005*"\'u.s\'," + 0.004*"\'year\'," + 0.004*"\'muslim\'," + '
  '0.004*"\'state\'," + 0.004*"\'united\'," + 0.004*"\'islamic\'," + '
  '0.003*"\'people\',"'),
 (4,
  '0.010*"\'obama\'," + 0.007*"\'say\'," + 0.005

## Per la categoria Government News i seguenti topic sono i più ricorrenti : 'clinton' - 'say'- 'trump' -'obama'-'iran'

In [None]:
df_test_US_News=df_filtered[df_filtered['subject'] == 'US_News']
documents_US_News=df_test_US_News['tokenized_title']+ " " + df_test_US_News['tokenized_text']
documents_US_News=(list(documents_US_News))

In [None]:
# creo dizionario per questa categoria - aumento il numero di passes come suggerito da colab


tokenized_doc_US_News = [doc.split() for doc in documents_US_News]

id2word=corpora.Dictionary(tokenized_doc_US_News)

corpus_US_News=[id2word.doc2bow(text)for text in tokenized_doc_US_News]

num_topics=5

lda_model_US_News=gensim.models.LdaMulticore(corpus=corpus_US_News,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=15
                                     )
pprint(lda_model_US_News.print_topics())
doc_lda_US_News=lda_model_US_News[corpus_US_News]

[(0,
  '0.012*"\'trump\'," + 0.008*"\'clinton\'," + 0.007*"\'news\'," + '
  '0.006*"\'medium\'," + 0.006*"\'say\'," + 0.004*"\'century\'," + '
  '0.004*"\'new\'," + 0.004*"\'state\'," + 0.004*"\'president\'," + '
  '0.004*"\'hillary\',"'),
 (1,
  '0.011*"\'room\'," + 0.009*"\'acr\'," + 0.009*"\'boiler\'," + '
  '0.008*"\'medium\'," + 0.007*"\'radio\'," + 0.006*"\'story\'," + '
  '0.006*"\'broadcast\'," + 0.005*"\'political\'," + 0.005*"\'report\'," + '
  '0.004*"\'shooting\',"'),
 (2,
  '0.010*"\'trump\'," + 0.005*"\'russia\'," + 0.005*"\'say\'," + '
  '0.005*"\'wire\'," + 0.005*"\'news\'," + 0.005*"\'medium\'," + '
  '0.005*"\'russian\'," + 0.004*"\'cia\'," + 0.004*"\'intelligence\'," + '
  '0.004*"\'election\',"'),
 (3,
  '0.005*"\'medium\'," + 0.004*"\'fbi\'," + 0.004*"\'say\'," + '
  '0.003*"\'new\'," + 0.003*"\'news\'," + 0.003*"\'time\'," + '
  '0.003*"\'case\'," + 0.003*"\'report\'," + 0.003*"\'state\'," + '
  '0.003*"\'event\',"'),
 (4,
  '0.012*"\'syria\'," + 0.006*"\'state\',

##Per la categoria US_News i seguenti topic sono i più ricorrenti : 'trump'-'cia'-'syria'-'clinton'-'acr'

In [None]:
df_test_Middle_east=df_filtered[df_filtered['subject'] == 'Middle-east']
documents_Middle_east=df_test_Middle_east['tokenized_title']+ " " + df_test_Middle_east['tokenized_text']
documents_Middle_east=(list(documents_Middle_east))

In [None]:
# creo dizionario per questa categoria - aumento il numero di passes come suggerito da colab


tokenized_doc_Middle_east = [doc.split() for doc in documents_Middle_east]

id2word=corpora.Dictionary(tokenized_doc_Middle_east)

corpus_Middle_east=[id2word.doc2bow(text)for text in tokenized_doc_Middle_east]

num_topics=5

lda_model_Middle_east=gensim.models.LdaMulticore(corpus=corpus_Middle_east,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=15
                                     )
pprint(lda_model_Middle_east.print_topics())
doc_lda_Middle_east=lda_model_Middle_east[corpus_Middle_east]

[(0,
  '0.003*"\'state\'," + 0.003*"\'time\'," + 0.003*"\'say\'," + '
  '0.003*"\'ramsey\'," + 0.003*"\'world\'," + 0.003*"\'new\'," + '
  '0.003*"\'people\'," + 0.003*"\'century\'," + 0.002*"\'wire\'," + '
  '0.002*"\'news\',"'),
 (1,
  '0.005*"\'wire\'," + 0.005*"\'new\'," + 0.005*"\'story\'," + '
  '0.004*"\'say\'," + 0.004*"\'cia\'," + 0.004*"\'news\'," + '
  '0.004*"\'american\'," + 0.004*"\'year\'," + 0.004*"\'century\'," + '
  '0.004*"\'fbi\',"'),
 (2,
  '0.010*"\'news\'," + 0.008*"\'trump\'," + 0.008*"\'medium\'," + '
  '0.006*"\'russian\'," + 0.006*"\'say\'," + 0.006*"\'russia\'," + '
  '0.005*"\'fake\'," + 0.005*"\'report\'," + 0.005*"\'post\'," + '
  '0.004*"\'story\',"'),
 (3,
  '0.018*"\'trump\'," + 0.006*"\'political\'," + 0.006*"\'president\'," + '
  '0.005*"\'clinton\'," + 0.005*"\'room\'," + 0.005*"\'medium\'," + '
  '0.005*"\'say\'," + 0.004*"\'century\'," + 0.004*"\'election\'," + '
  '0.004*"\'acr\',"'),
 (4,
  '0.006*"\'syria\'," + 0.006*"\'state\'," + 0.005*"\'med

## Per la categoria Middle-east ecco i topic ricorrenti : 'trump' - 'news' - 'syria'-'medium'-'state'

## Domanda 3:  I titoli delle fake news presentano dei pattern ?

In [None]:
# carico il file di backup

import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

file_path = '/content/drive/My Drive/Colab Notebooks/nlp-cleaned.csv'
df_cleaned = pd.read_csv(file_path)

Mounted at /content/drive


In [None]:
# filtro il dataset principale concentrandomi solo sulle fake news

df_filtered = df_cleaned[df_cleaned['flag'] == 1]
df_filtered

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,flag,id,tokenized_title,tokenized_text
1,7539,"Secret Service Agents Jump On Stage, Surround...",Donald Trump was briefly surrounded by Secret ...,News,"March 12, 2016",1,28956,"['secret', 'service', 'agent', 'jump', 'stage'...","['donald', 'trump', 'briefly', 'surround', 'se..."
4,8711,Tim Allen Cracks A Joke About Obama – Would B...,Self-proclaimed fiscal-conservative Tim Alle...,News,"January 17, 2016",1,30128,"['tim', 'allen', 'crack', 'joke', 'obama', 'de...","['self', 'proclaim', 'fiscal', 'conservative',..."
8,1559,Sally Yates Just Opened A Can Of Constitution...,During testimony in front of the Senate Intell...,News,"May 8, 2017",1,22976,"['sally', 'yate', 'open', 'constitutional', 'w...","['testimony', 'senate', 'intelligence', 'commi..."
10,11303,NEWT GINGRICH: If this had been one of Trump’s...,,politics,"Mar 25, 2017",1,32720,"['newt', 'gingrich', 'trump', 'hotel', 'there’...",[]
11,11915,LT GEN MCINERNEY’S Take On Trump Dossier And C...,,politics,"Jan 13, 2017",1,33332,"['gen', 'mcinerney', 'trump', 'dossier', 'clin...",[]
...,...,...,...,...,...,...,...,...,...
44889,7211,Watch A CNN Anchor Put Trump In His Place On ...,Republican presidential frontrunner Donald Tru...,News,"March 30, 2016",1,28628,"['watch', 'cnn', 'anchor', 'trump', 'place', '...","['republican', 'presidential', 'frontrunner', ..."
44890,19917,LOL! Leftist CA Congresswoman On Tonight’s Deb...,The Democrats are in full panic-mode over Croo...,left-news,"Sep 26, 2016",1,41334,"['lol', 'leftist', 'congresswoman', 'tonight',...","['democrats', 'panic', 'mode', 'crooked', 'hil..."
44891,9907,Trump Visits Hurricane Irma Survivors…One Surv...,You have to love this! A Florida man greeted P...,politics,"Sep 14, 2017",1,31324,"['trump', 'visit', 'hurricane', 'irma', 'survi...","['love', 'florida', 'man', 'greet', 'president..."
44893,16417,HOW DID THE FBI NOT REPORT THIS? Devastating F...,Former NSA officer John Schindler reports on a...,Government News,"Sep 6, 2016",1,37834,"['fbi', 'report', 'devastating', 'fact', 'ment...","['nsa', 'officer', 'john', 'schindler', 'repor..."


In [None]:
# utilizzo il Tf-idf per conteggiare le parole

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

corpus=df_filtered['tokenized_title']

fn_tfidf = tfidf_vectorizer.fit_transform(corpus)
len(tfidf_vectorizer.vocabulary_)

11563

In [None]:
# riduco dimensione del vocabolario

tfidf_vectorizer = TfidfVectorizer(min_df = 0.0001)
fn_tfidf = tfidf_vectorizer.fit_transform(corpus)
len(tfidf_vectorizer.vocabulary_)

6511

In [None]:
fn_tfidf

<23481x6511 sparse matrix of type '<class 'numpy.float64'>'
	with 225875 stored elements in Compressed Sparse Row format>

In [None]:
print(tfidf_vectorizer.vocabulary_)



In [None]:
type(tfidf_vectorizer.vocabulary_)

dict

In [None]:
# Trovo le parole più usate nei titoli fake

top_items = sorted(tfidf_vectorizer.vocabulary_.items(), key=lambda item: item[1], reverse=True)[:50]
top_df = pd.DataFrame(top_items, columns=['word', 'count'])
print("Le 50 parole più usate nei titoli fake:")
print(top_df)


Le 50 parole più usate nei titoli fake:
               word  count
0             žižek  11562
1              état  11561
2        zuckerberg  11560
3            zucker  11559
4         zoolander  11558
5               zoo  11557
6            zoning  11556
7              zone  11555
8     zombiehillary  11554
9            zombie  11553
10              zit  11552
11          zionist  11551
12           zinger  11550
13             zing  11549
14        zimmerman  11548
15         zimbabwe  11547
16            zilch  11546
17             zika  11545
18             zero  11544
19           zephyr  11543
20         zelnicek  11542
21        zellweger  11541
22           zealot  11540
23          zealand  11539
24               ze  11538
25         zbigniew  11537
26             zari  11536
27          zarakia  11535
28             zaps  11534
29        zakharova  11533
30          zakaria  11532
31           yuuuge  11531
32            yuuge  11530
33            yulín  11529
34             

### Difficile individuare un pattern preciso, la cosa che mi colpisce di più è che nella top 50  ci siano poche parole  legate alla politica.
### Sembra che i titoli delle fake news facciano più riferimento al mondo del pop
### Altro fatto insolito è che le prime 30 parole comincino tutte con la z , forse per attirare maggiormente l'attenzione

## Una volta addestrato il modello esportalo utilizzando pickle così che i tuoi colleghi possano metterlo in produzione

In [None]:
#addestro modello per individuare le fake news

# modifico valori della feature flag per migliorare la leggibilità delle previsioni del futuro modello

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

data=df_cleaned
data['flag'] = data['flag'].replace({0: 'true', 1: 'fake'})
data



Unnamed: 0.1,Unnamed: 0,title,text,subject,date,flag,id,tokenized_title,tokenized_text
0,7048,Trump calls Green Party vote recount request a...,"WEST PALM BEACH, Fla. (Reuters) - U.S. Preside...",politicsNews,"November 26, 2016",true,7048,"['trump', 'call', 'green', 'party', 'vote', 'r...","['west', 'palm', 'beach', 'fla', 'reuters', 'u..."
1,7539,"Secret Service Agents Jump On Stage, Surround...",Donald Trump was briefly surrounded by Secret ...,News,"March 12, 2016",fake,28956,"['secret', 'service', 'agent', 'jump', 'stage'...","['donald', 'trump', 'briefly', 'surround', 'se..."
2,14067,"U.S. strikes on Taliban opium labs won't work,...","LASHKAR GAH, Afghanistan/KABUL (Reuters) - As ...",worldnews,"November 23, 2017",true,14067,"['u.s', 'strike', 'taliban', 'opium', 'lab', '...","['lashkar', 'gah', 'afghanistan', 'kabul', 're..."
3,21373,EU citizens leaving UK pushes down net migrati...,LONDON (Reuters) - Net migration to Britain fe...,worldnews,"August 24, 2017",true,21373,"['citizen', 'leave', 'push', 'net', 'migration...","['london', 'reuters', 'net', 'migration', 'bri..."
4,8711,Tim Allen Cracks A Joke About Obama – Would B...,Self-proclaimed fiscal-conservative Tim Alle...,News,"January 17, 2016",fake,30128,"['tim', 'allen', 'crack', 'joke', 'obama', 'de...","['self', 'proclaim', 'fiscal', 'conservative',..."
...,...,...,...,...,...,...,...,...,...
44893,16417,HOW DID THE FBI NOT REPORT THIS? Devastating F...,Former NSA officer John Schindler reports on a...,Government News,"Sep 6, 2016",fake,37834,"['fbi', 'report', 'devastating', 'fact', 'ment...","['nsa', 'officer', 'john', 'schindler', 'repor..."
44894,3655,U.S. Supreme Court leaves key campaign finance...,WASHINGTON (Reuters) - The U.S. Supreme Court ...,politicsNews,"May 22, 2017",true,3655,"['u.s', 'supreme', 'court', 'leave', 'key', 'c...","['washington', 'reuters', 'u.s', 'supreme', 'c..."
44895,21021,Voice of triumph or doom: North Korean present...,SEOUL (Reuters) - Wearing a pink Korean dress ...,worldnews,"September 4, 2017",true,21021,"['voice', 'triumph', 'doom', 'north', 'korean'...","['seoul', 'reuters', 'wear', 'pink', 'korean',..."
44896,12542,BREAKING: HILLARY’S CAMPAIGN CHAIRMAN On Close...,It s good to know Hillary s Campaign Chairman ...,politics,"Nov 1, 2016",fake,33959,"['break', 'hillary', 'campaign', 'chairman', '...","['good', 'know', 'hillary', 'campaign', 'chair..."


In [None]:
# creo dataset train
c, text, c1, flag= train_test_split(data['text'], data['flag'],test_size=0.10, random_state=42)

In [None]:
len(text)

4490

In [None]:
len(flag)

4490

In [None]:
# sul totale del dataset di train ci sono 2339 fake ovvero il 52 %

len(flag[flag=='fake'])

2339

In [None]:
# sul totale del dataset di train ci sono 2151 true ovvero il 48 %

len(flag[flag=='true'])

2151

In [None]:
# utilizzo la funzione di preprocessing vista durante il corso per il testo grezzo

import string
import spacy
from nltk.corpus import stopwords
import re

english_stopwords = stopwords.words('english')
nlp = spacy.load('en_core_web_sm')
punctuation = set(string.punctuation)

def data_cleaner(sentence):
    sentence = sentence.lower()
    for c in string.punctuation:
        sentence = sentence.replace(c, " ")
    document = nlp(sentence)
    sentence = ' '.join(token.lemma_ for token in document)
    sentence = ' '.join(word for word in sentence.split() if word not in english_stopwords)
    sentence = re.sub('\d', '', sentence)

    return sentence

In [None]:
text_cleaned = []
for r in text:
    text_cleaned.append(data_cleaner(r))

In [None]:
text_cleaned

['stock market lose  point abc news erroneously report general flynn communication russian ambassador sergey kislyak trump campaign turn stock market plunge feeding frenzy leftist medium abc news get wrong flynn actually contact russian ambassador trump transition period entirely different story fact accord video uncover citizen journalist jack posobiec obama state department tell reporter trump transition period state department problem transition team meet foreign official see video accord cnn correspondent jim acosta obama regime actually give go ahead flynn conversation russian ambassador friday white house say obama administration authorize former national security adviser michael flynn contact russian ambassador sergey kislyak president trump transition accord cnn flynn plead guilty friday lie fbi contact kislyak month trump take office first current former trump white house official bring special counsel robert mueller investigation russian election meddling court record indicat

In [None]:
# creo dataset test

x, x_test, y, y_test = train_test_split(text_cleaned, flag, test_size=0.20, random_state=42)

In [None]:
#utilizzo il countVectorizer per trasformare le parole in numeri

vec = CountVectorizer()
x = vec.fit_transform(x).toarray()
x_test = vec.transform(x_test).toarray()

In [None]:
# procedo poi con il training del modello

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(x, y)

In [None]:
# verifico l'accuratezza del modello con il dataset di test

model.score(x_test, y_test)

0.9521158129175946

In [None]:
# Testo una frase del dataset true - PRIMO TEST

sentence="The Kremlin said on Tuesday that possible supplies of lethal weapons "

sentence_cleaned=data_cleaner(sentence)

sentence_cleaned

sentence_cv=vec.transform([sentence_cleaned])
model.predict(sentence_cv)

array(['true'], dtype='<U4')

In [None]:
# Il modello ha riconoscuto come attendibile al 99 % la notizia

model.predict_proba(sentence_cv)

array([[0.00904975, 0.99095025]])

In [None]:
# Testo una frase da me inventata verosimilmente falsa - SECONDO TEST

sentence="Obama is a unicorn, loves to carry children on his back"

sentence_cleaned=data_cleaner(sentence)

sentence_cleaned

sentence_cv=vec.transform([sentence_cleaned])
model.predict(sentence_cv)

array(['fake'], dtype='<U4')

In [None]:
# il modello ha riconosciuto come falsa al 97 % la notizia

model.predict_proba(sentence_cv)

array([[0.97212926, 0.02787074]])

In [None]:
# Testo una frase dal dataset fake - TERZO TEST TEST

sentence="A new animatronic figure in the Hall of Presidents at Walt Disney World was added"

sentence_cleaned=data_cleaner(sentence)

sentence_cleaned

sentence_cv=vec.transform([sentence_cleaned])
model.predict(sentence_cv)

array(['fake'], dtype='<U4')

In [None]:
# il modello ha riconosciuto come falsa al 76 % la notizia

model.predict_proba(sentence_cv)

array([[0.76057302, 0.23942698]])

In [None]:
# Testo una frase da me inventata verosimilmente vera - QUARTO TEST

sentence="Obama won a Noble prize"

sentence_cleaned=data_cleaner(sentence)

sentence_cleaned

sentence_cv=vec.transform([sentence_cleaned])
model.predict(sentence_cv)

array(['fake'], dtype='<U4')

In [None]:
# il modello ha riconosciuto come falsa al 60 % la notizia - immagino perchè nella maggior parte
#dei casi del mio dataset la parola Obama sia associata a fake news

model.predict_proba(sentence_cv)

array([[0.60260096, 0.39739904]])

## Ho addestrato un classificatore bayesiano, sicuramente migliorabile aumentato la dimensione del dataset di addestramento o ancora meglio applicando la cross validation


# Provvedo ad esportarlo in con libreria pickle



In [None]:
import pickle

with open('model.pkl', 'wb') as file:
    pickle.dump(model, file)

## Faccio un test di importazione

In [None]:
with open('model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)


X_test = "Trump and Obama both strongly belive in Scientology"
X_test=data_cleaner(X_test)

X_test_count= vec.transform([X_test])

predictions = loaded_model.predict(X_test_count)

print(predictions)

['fake']


## il modello pickle ha riconosciuto la notizia come fake