# Filtro per Fake News
Il problema delle fake news è cresciuto esponenzialmente nell'ultimo decennio a causa della crescente diffusione dei social network, il governo degli Stati Uniti ha deciso di muoversi a tal proposito, incaricando la tua azienda di realizzare un plug-in per chrome in grado di riconoscere se una notizia è falsa. Il tuo compito è quello di realizzare il modello in grado di riconoscere le notizie false, che poi il team di machine learning enginner e web developer metterà in produzione. Ti vengono messi a disposizioni due raccolte di notizie, una contenente solo notizie false e l'altra contenente solo notizie vere, utilizzale per addestrare il tuo modello.

### [Link al dataset](https://proai-datasets.s3.eu-west-3.amazonaws.com/fake_news.zip)

Parti da un'accurata analisi, rispondendo a domande come:
- le fake news sono più frequenti in una determinata categoria?
- per ogni categoria, ci sono argomenti che sono più soggetti alle fake news?
- I titoli delle fake news presentano dei pattern?

Una volta addestrato il modello esportalo [utilizzando pickle](https://scikit-learn.org/stable/model_persistence.html) così che i tuoi colleghi possano metterlo in produzione.

## Importiamo i dataset

In [None]:
!wget https://proai-datasets.s3.eu-west-3.amazonaws.com/fake_news.zip
!unzip fake_news.zip

--2024-03-24 15:36:20--  https://proai-datasets.s3.eu-west-3.amazonaws.com/fake_news.zip
Resolving proai-datasets.s3.eu-west-3.amazonaws.com (proai-datasets.s3.eu-west-3.amazonaws.com)... 52.95.154.82, 3.5.224.142
Connecting to proai-datasets.s3.eu-west-3.amazonaws.com (proai-datasets.s3.eu-west-3.amazonaws.com)|52.95.154.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42975911 (41M) [application/zip]
Saving to: ‘fake_news.zip’


2024-03-24 15:36:23 (21.7 MB/s) - ‘fake_news.zip’ saved [42975911/42975911]

Archive:  fake_news.zip
  inflating: Fake.csv                
  inflating: True.csv                


In [None]:
import pandas as pd

In [None]:
df_true = pd.read_csv("True.csv")
df_true.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB


In [None]:
df_fake = pd.read_csv("Fake.csv")
df_fake.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB


#Addestriamo un modello in grado di riconoscere fake news:

## Uniamo i due dataset:

In [None]:
df_true['source'] = 'true'
df_fake['source'] = 'fake'
df_true_subset = df_true[['title', 'text', 'source']]
df_fake_subset = df_fake[['title', 'text', 'source']]
df_news = pd.concat([df_true_subset, df_fake_subset], ignore_index=True)
df_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   44898 non-null  object
 1   text    44898 non-null  object
 2   source  44898 non-null  object
dtypes: object(3)
memory usage: 1.0+ MB


## Definiamo le funzioni per il Data Preprocessing (pulizia e vettorizzazione):

In [None]:
#data_cleaner
import nltk
from nltk.corpus import stopwords
import re
import spacy
import string
nltk.download('stopwords')
english_stopwords = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_sm')
punctuation=set(string.punctuation)

def data_cleaner(dataset):

    def filter_tokens(text):
        doc = nlp(text)
        filtered_tokens = [token.text for token in doc if token.pos_ not in ['PRON', 'VERB', 'ADV', 'AUX', 'ADP']]
        return ' '.join(filtered_tokens)

    dataset_to_return = []
    for sentence in dataset:
        sentence = sentence.lower()
        sentence = ''.join([char for char in sentence if char not in string.punctuation])
        sentence = ' '.join(word for word in sentence.split() if word not in english_stopwords)
        sentence = re.sub('\d', '', sentence)
        sentence = ' '.join(word for word in sentence.split() if len(word) > 3)
        sentence = filter_tokens(sentence)
        dataset_to_return.append(sentence)

    return dataset_to_return

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer()
def bow_tfidf(dataset, vectorizer):
  if vectorizer==None:
    vectorizer=TfidfVectorizer()
    X=vectorizer.fit_transform(dataset)
  else:
    X=vectorizer.transform(dataset)
  return X.toarray(),vectorizer

## Puliamo il dataset, poi individuiamo il subset su cui effettuare l'addestramento e lo vettorizziamo:

In [None]:
news_text_cleaned=data_cleaner(df_news['text']) #impiega più di 15 min.

In [None]:
#salva file:
import pickle

with open('news_text_cleaned.pkl', 'wb') as f:
    pickle.dump(news_text_cleaned, f)

In [None]:
#carica file:
import pickle

with open('news_text_cleaned.pkl', 'rb') as f:
    news_text_cleaned = pickle.load(f)

In [None]:
df_news['text_cleaned']=news_text_cleaned

In [None]:
from sklearn.model_selection import train_test_split
half1_df,half2_df=train_test_split(df_news,test_size=0.50,random_state=11)
df_train,df_test=train_test_split(half1_df,test_size=0.33,random_state=11)

In [None]:
train_news_cleaned,vectorized=bow_tfidf(df_train['text_cleaned'], None)
train_news_cleaned

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
len(df_train[df_train['source']=='fake'])

7900

In [None]:
len(df_train[df_train['source']=='true'])

7140

## Creiamo e addestriamo il modello:

In [None]:
from sklearn.neural_network import MLPClassifier
clf= MLPClassifier(activation='logistic',
                   solver='adam',
                   max_iter=50,
                   hidden_layer_sizes=(5),
                   tol=0.01,
                   verbose=True
                   )

In [None]:
clf.fit(train_news_cleaned,df_train['source'])

Iteration 1, loss = 0.68174977
Iteration 2, loss = 0.65558769
Iteration 3, loss = 0.62038984
Iteration 4, loss = 0.57611418
Iteration 5, loss = 0.52707486
Iteration 6, loss = 0.47791007
Iteration 7, loss = 0.43168790
Iteration 8, loss = 0.38999817
Iteration 9, loss = 0.35309893
Iteration 10, loss = 0.32069407
Iteration 11, loss = 0.29231265
Iteration 12, loss = 0.26735386
Iteration 13, loss = 0.24542014
Iteration 14, loss = 0.22593895
Iteration 15, loss = 0.20857697
Iteration 16, loss = 0.19306481
Iteration 17, loss = 0.17914302
Iteration 18, loss = 0.16661920
Iteration 19, loss = 0.15523532
Iteration 20, loss = 0.14495515
Iteration 21, loss = 0.13558741
Iteration 22, loss = 0.12703203
Iteration 23, loss = 0.11922818
Iteration 24, loss = 0.11206156
Iteration 25, loss = 0.10545837
Iteration 26, loss = 0.09938338
Iteration 27, loss = 0.09379462
Iteration 28, loss = 0.08863043
Iteration 29, loss = 0.08384579
Iteration 30, loss = 0.07940105
Iteration 31, loss = 0.07529581
Training loss did

In [None]:
import pickle

with open('filtro_fake_news.pkl', 'wb') as f:
    pickle.dump(clf, f)

In [None]:
import pickle

with open('filtro_fake_news.pkl', 'rb') as f:
    clf = pickle.load(f)

## Verifichiamo la validità del modello:

In [None]:
test_news_cleaned,vectorized=bow_tfidf(df_test['text_cleaned'], vectorized)
test_news_cleaned

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
clf.score(test_news_cleaned,df_test['source'])

0.9657173707652854

In [None]:
from sklearn.model_selection import cross_val_score

X_vectorized = vectorizer.fit_transform(df_news['text_cleaned'])
cross_val_scores = cross_val_score(clf, X_vectorized, df_news['source'], cv=3, scoring='accuracy')
#impiega più di 15 min

Iteration 1, loss = 0.67328794
Iteration 2, loss = 0.61915021
Iteration 3, loss = 0.54235464
Iteration 4, loss = 0.46245305
Iteration 5, loss = 0.39397913
Iteration 6, loss = 0.33903491
Iteration 7, loss = 0.29519766
Iteration 8, loss = 0.25986847
Iteration 9, loss = 0.23093062
Iteration 10, loss = 0.20685641
Iteration 11, loss = 0.18650328
Iteration 12, loss = 0.16912843
Iteration 13, loss = 0.15411526
Iteration 14, loss = 0.14104061
Iteration 15, loss = 0.12953180
Iteration 16, loss = 0.11936114
Iteration 17, loss = 0.11030837
Iteration 18, loss = 0.10219749
Iteration 19, loss = 0.09491935
Iteration 20, loss = 0.08835645
Iteration 21, loss = 0.08241698
Iteration 22, loss = 0.07700334
Iteration 23, loss = 0.07208588
Iteration 24, loss = 0.06758867
Iteration 25, loss = 0.06347056
Iteration 26, loss = 0.05970435
Iteration 27, loss = 0.05623257
Training loss did not improve more than tol=0.010000 for 10 consecutive epochs. Stopping.
Iteration 1, loss = 0.66708147
Iteration 2, loss = 0.58

In [None]:
with open('cross_val_scores.pkl', 'wb') as f:
    pickle.dump(cross_val_scores, f)

In [None]:
with open('cross_val_scores.pkl', 'rb') as f:
    cross_val_scores = pickle.load(f)

In [None]:
cross_val_scores

array([0.93538688, 0.91894962, 0.96418549])

In [None]:
cross_val_scores.mean()

0.9395073277206111

#Analisi del dataset

##Le fake news sono più frequenti in una determinata categoria?

In [None]:
df_fake['subject'].value_counts()

News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: subject, dtype: int64

##Per ogni categoria, ci sono argomenti che sono più soggetti alle fake news?

In [None]:
fake_text_cleaned=data_cleaner(df_fake['text']) #impiega più di 15 min.

In [None]:
with open('fake_text_cleaned.pkl', 'wb') as f:
    pickle.dump(fake_text_cleaned, f)

In [None]:
with open('fake_text_cleaned.pkl', 'rb') as f:
    fake_text_cleaned = pickle.load(f)

In [None]:
df_fake['text_cleaned']=fake_text_cleaned

In [None]:
news_subject_fake = df_fake[df_fake['subject'] == 'News']['text_cleaned']

In [None]:
import gensim
from gensim.utils import simple_preprocess
import gensim.corpora as corpora

def sent_to_words(items):
  for item in items:
    yield(simple_preprocess(item,deacc=True))

data_words=list(sent_to_words(news_subject_fake))
id2word=corpora.Dictionary(data_words)
corpus=[id2word.doc2bow(text) for text in data_words]
from pprint import pprint
num_topics=5
lda_model=gensim.models.LdaMulticore(corpus=corpus,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=10
                                     )
for topic_num in range(num_topics):
    pprint(lda_model.print_topic(topic_num))
fake_lda=lda_model[corpus]

('0.008*"people" + 0.006*"federal" + 0.006*"image" + 0.005*"government" + '
 '0.005*"bundy" + 0.005*"guns" + 0.005*"state" + 0.004*"time" + 0.003*"group" '
 '+ 0.003*"police"')
('0.012*"president" + 0.012*"trump" + 0.010*"republican" + 0.010*"cruz" + '
 '0.010*"people" + 0.009*"obama" + 0.007*"republicans" + 0.007*"image" + '
 '0.006*"women" + 0.006*"party"')
('0.068*"trump" + 0.016*"donald" + 0.011*"clinton" + 0.010*"president" + '
 '0.009*"hillary" + 0.008*"campaign" + 0.006*"news" + 0.006*"image" + '
 '0.006*"election" + 0.006*"realdonaldtrump"')
('0.018*"trump" + 0.013*"people" + 0.007*"water" + 0.006*"percent" + '
 '0.006*"republicans" + 0.006*"money" + 0.005*"health" + 0.005*"image" + '
 '0.005*"america" + 0.005*"americans"')
('0.015*"people" + 0.014*"black" + 0.014*"white" + 0.013*"trump" + '
 '0.010*"police" + 0.008*"racist" + 0.007*"video" + 0.007*"image" + '
 '0.004*"january" + 0.004*"donald"')


In [None]:
politics_subject_fake = df_fake[df_fake['subject'] == 'politics']['text_cleaned']

In [None]:
data_words=list(sent_to_words(politics_subject_fake))
id2word=corpora.Dictionary(data_words)
corpus=[id2word.doc2bow(text) for text in data_words]
from pprint import pprint
num_topics=5
lda_model=gensim.models.LdaMulticore(corpus=corpus,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=10
                                     )
for topic_num in range(num_topics):
    pprint(lda_model.print_topic(topic_num))
fake_lda=lda_model[corpus]

('0.008*"police" + 0.008*"city" + 0.006*"court" + 0.006*"federal" + '
 '0.006*"state" + 0.005*"county" + 0.005*"mayor" + 0.005*"people" + '
 '0.004*"department" + 0.004*"judge"')
('0.031*"trump" + 0.012*"president" + 0.008*"donald" + 0.008*"people" + '
 '0.007*"republican" + 0.006*"obama" + 0.005*"news" + 0.005*"party" + '
 '0.005*"house" + 0.005*"immigration"')
('0.015*"obama" + 0.006*"united" + 0.005*"states" + 0.005*"president" + '
 '0.005*"american" + 0.005*"government" + 0.005*"state" + 0.004*"muslim" + '
 '0.004*"iran" + 0.004*"america"')
('0.035*"clinton" + 0.022*"hillary" + 0.009*"state" + 0.007*"campaign" + '
 '0.006*"department" + 0.006*"email" + 0.005*"bill" + 0.005*"former" + '
 '0.005*"emails" + 0.005*"news"')
('0.010*"people" + 0.010*"black" + 0.007*"white" + 0.006*"president" + '
 '0.006*"police" + 0.006*"trump" + 0.005*"obama" + 0.005*"video" + '
 '0.004*"america" + 0.004*"time"')


In [None]:
leftnews_subject_fake = df_fake[df_fake['subject'] == 'left-news']['text_cleaned']

In [None]:
data_words=list(sent_to_words(leftnews_subject_fake))
id2word=corpora.Dictionary(data_words)
corpus=[id2word.doc2bow(text) for text in data_words]
from pprint import pprint
num_topics=5
lda_model=gensim.models.LdaMulticore(corpus=corpus,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=10
                                     )
for topic_num in range(num_topics):
    pprint(lda_model.print_topic(topic_num))
fake_lda=lda_model[corpus]

('0.006*"trump" + 0.005*"obama" + 0.004*"president" + 0.004*"government" + '
 '0.004*"people" + 0.004*"state" + 0.003*"city" + 0.003*"news" + 0.003*"media" '
 '+ 0.003*"american"')
('0.007*"president" + 0.006*"first" + 0.006*"parenthood" + 0.004*"news" + '
 '0.004*"source" + 0.004*"iran" + 0.004*"abortion" + 0.004*"people" + '
 '0.003*"school" + 0.003*"obama"')
('0.012*"black" + 0.012*"trump" + 0.012*"hillary" + 0.012*"clinton" + '
 '0.009*"police" + 0.008*"people" + 0.008*"president" + 0.007*"white" + '
 '0.005*"obama" + 0.005*"video"')
('0.008*"police" + 0.006*"court" + 0.006*"state" + 0.006*"clinton" + '
 '0.005*"people" + 0.005*"department" + 0.004*"news" + 0.004*"federal" + '
 '0.003*"hillary" + 0.003*"investigation"')
('0.008*"people" + 0.007*"obama" + 0.006*"trump" + 0.005*"students" + '
 '0.005*"school" + 0.005*"president" + 0.004*"children" + 0.004*"news" + '
 '0.004*"american" + 0.003*"white"')


In [None]:
governmentnews_subject_fake = df_fake[df_fake['subject'] == 'Government News']['text_cleaned']

In [None]:
data_words=list(sent_to_words(governmentnews_subject_fake))
id2word=corpora.Dictionary(data_words)
corpus=[id2word.doc2bow(text) for text in data_words]
from pprint import pprint
num_topics=5
lda_model=gensim.models.LdaMulticore(corpus=corpus,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=10
                                     )
for topic_num in range(num_topics):
    pprint(lda_model.print_topic(topic_num))
fake_lda=lda_model[corpus]

('0.008*"obama" + 0.007*"people" + 0.006*"president" + 0.006*"state" + '
 '0.005*"government" + 0.005*"trump" + 0.004*"immigration" + 0.004*"illegal" + '
 '0.004*"court" + 0.004*"bill"')
('0.006*"food" + 0.006*"obama" + 0.005*"president" + 0.005*"government" + '
 '0.005*"house" + 0.004*"program" + 0.004*"people" + 0.004*"trump" + '
 '0.004*"american" + 0.003*"million"')
('0.008*"obama" + 0.006*"refugees" + 0.005*"people" + 0.005*"state" + '
 '0.005*"president" + 0.005*"government" + 0.005*"states" + 0.004*"united" + '
 '0.004*"federal" + 0.004*"police"')
('0.012*"clinton" + 0.008*"obama" + 0.007*"president" + 0.007*"state" + '
 '0.006*"hillary" + 0.006*"department" + 0.005*"people" + 0.005*"house" + '
 '0.004*"court" + 0.004*"government"')
('0.008*"iran" + 0.007*"united" + 0.007*"obama" + 0.006*"president" + '
 '0.006*"nuclear" + 0.006*"people" + 0.006*"trump" + 0.006*"america" + '
 '0.005*"states" + 0.005*"american"')


In [None]:
usnews_subject_fake = df_fake[df_fake['subject'] == 'US_News']['text_cleaned']

In [None]:
data_words=list(sent_to_words(usnews_subject_fake))
id2word=corpora.Dictionary(data_words)
corpus=[id2word.doc2bow(text) for text in data_words]
from pprint import pprint
num_topics=5
lda_model=gensim.models.LdaMulticore(corpus=corpus,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=10
                                     )
for topic_num in range(num_topics):
    pprint(lda_model.print_topic(topic_num))
fake_lda=lda_model[corpus]

('0.008*"media" + 0.006*"story" + 0.005*"wire" + 0.004*"news" + '
 '0.004*"government" + 0.004*"federal" + 0.004*"public" + 0.004*"shooter" + '
 '0.004*"security" + 0.004*"mass"')
('0.014*"syria" + 0.008*"media" + 0.007*"news" + 0.006*"syrian" + 0.006*"wire" '
 '+ 0.006*"government" + 0.006*"state" + 0.005*"washington" + 0.005*"military" '
 '+ 0.005*"century"')
('0.010*"wire" + 0.008*"room" + 0.008*"boiler" + 0.007*"news" + 0.007*"radio" '
 '+ 0.006*"political" + 0.006*"media" + 0.005*"russian" + 0.005*"episode" + '
 '0.005*"broadcast"')
('0.006*"media" + 0.006*"clinton" + 0.006*"trump" + 0.005*"news" + '
 '0.005*"president" + 0.005*"wire" + 0.004*"state" + 0.004*"century" + '
 '0.004*"order" + 0.004*"washington"')
('0.020*"trump" + 0.011*"clinton" + 0.007*"wire" + 0.006*"president" + '
 '0.006*"election" + 0.006*"century" + 0.006*"political" + 0.005*"hillary" + '
 '0.005*"media" + 0.005*"russia"')


In [None]:
middleeast_subject_fake = df_fake[df_fake['subject'] == 'Middle-east']['text_cleaned']

In [None]:
data_words=list(sent_to_words(middleeast_subject_fake))
id2word=corpora.Dictionary(data_words)
corpus=[id2word.doc2bow(text) for text in data_words]
from pprint import pprint
num_topics=5
lda_model=gensim.models.LdaMulticore(corpus=corpus,
                                     id2word=id2word,
                                     num_topics=num_topics,
                                     passes=10
                                     )
for topic_num in range(num_topics):
    pprint(lda_model.print_topic(topic_num))
fake_lda=lda_model[corpus]

('0.009*"clinton" + 0.008*"wire" + 0.005*"century" + 0.005*"state" + '
 '0.004*"world" + 0.004*"president" + 0.004*"news" + 0.004*"people" + '
 '0.004*"trump" + 0.004*"hillary"')
('0.006*"news" + 0.005*"people" + 0.004*"world" + 0.004*"wire" + '
 '0.004*"century" + 0.004*"trump" + 0.004*"political" + 0.003*"media" + '
 '0.003*"israel" + 0.003*"american"')
('0.010*"media" + 0.009*"syria" + 0.008*"news" + 0.006*"wire" + 0.005*"story" '
 '+ 0.005*"government" + 0.004*"century" + 0.004*"washington" + 0.004*"syrian" '
 '+ 0.004*"state"')
('0.010*"room" + 0.009*"boiler" + 0.007*"radio" + 0.006*"wire" + '
 '0.005*"broadcast" + 0.005*"political" + 0.005*"another" + 0.005*"media" + '
 '0.004*"current" + 0.004*"episode"')
('0.023*"trump" + 0.009*"clinton" + 0.007*"wire" + 0.007*"election" + '
 '0.007*"media" + 0.007*"russia" + 0.007*"president" + 0.006*"political" + '
 '0.006*"century" + 0.005*"news"')


In [None]:
from collections import Counter

top_negative_words_by_subject = {}
for subject in df_fake['subject'].unique():
    subset = df_fake[df_fake['subject'] == subject]
    combined_text = ' '.join(subset['text_cleaned'])
    words = combined_text.split()
    word_counts = Counter(words)
    top_words = [word for word, _ in word_counts.most_common(10)]
    top_negative_words = []
    for word in top_words:
        probabilities = clf.predict_proba(bow_tfidf([word],vectorized)[0])
        negative_index = list(clf.classes_).index('fake')
        negative_percentage = probabilities[0][negative_index] *100
        top_negative_words.append((word, negative_percentage))
    top_negative_words.sort(key=lambda x: x[1], reverse=True)
    top_negative_words = top_negative_words[:5]
    top_negative_words_by_subject[subject] = top_negative_words

for subject, words in top_negative_words_by_subject.items():
    print(f"Subject: {subject}")
    for word, percentage in words:
        print(f"Parola: {word}, Percentuale fake: {percentage:.2f}%")
    print("\n")

Subject: News
Parola: image, Percentuale fake: 99.32%
Parola: obama, Percentuale fake: 98.45%
Parola: time, Percentuale fake: 98.41%
Parola: people, Percentuale fake: 94.09%
Parola: clinton, Percentuale fake: 89.51%


Subject: politics
Parola: hillary, Percentuale fake: 98.84%
Parola: obama, Percentuale fake: 98.45%
Parola: time, Percentuale fake: 98.41%
Parola: news, Percentuale fake: 96.56%
Parola: people, Percentuale fake: 94.09%


Subject: Government News
Parola: obama, Percentuale fake: 98.45%
Parola: people, Percentuale fake: 94.09%
Parola: clinton, Percentuale fake: 89.51%
Parola: trump, Percentuale fake: 84.93%
Parola: department, Percentuale fake: 77.85%


Subject: left-news
Parola: hillary, Percentuale fake: 98.84%
Parola: obama, Percentuale fake: 98.45%
Parola: black, Percentuale fake: 97.96%
Parola: news, Percentuale fake: 96.56%
Parola: people, Percentuale fake: 94.09%


Subject: US_News
Parola: wire, Percentuale fake: 99.18%
Parola: century, Percentuale fake: 99.11%
Parol

#I titoli delle fake news presentano dei pattern?

In [None]:
fake_title_cleaned=data_cleaner(df_fake['title']) #impiega più di 15 min.

In [None]:
with open('fake_title_cleaned.pkl', 'wb') as f:
    pickle.dump(fake_title_cleaned, f)

In [None]:
with open('fake_title_cleaned.pkl', 'rb') as f:
    fake_title_cleaned = pickle.load(f)

In [None]:
import gensim
from gensim.utils import simple_preprocess

def sent_to_words(items):
  for item in items:
    yield(simple_preprocess(item,deacc=True))

title_words = Counter(word for sublist in sent_to_words(fake_title_cleaned) for word in sublist)
top_words = title_words.most_common(10)
top_words

[('video', 8297),
 ('trump', 7861),
 ('obama', 2540),
 ('hillary', 2269),
 ('clinton', 1118),
 ('president', 1116),
 ('black', 877),
 ('news', 873),
 ('white', 854),
 ('donald', 784)]

In [None]:
title_vectorized,vectorized=bow_tfidf(fake_title_cleaned, None)
title_vectorized

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
import numpy as np

top_indices = np.argsort(title_vectorized)[0][::-1][:10]
top_words_with_tfidf = [(vectorized.get_feature_names_out()[index], title_vectorized[0, index]) for index in top_indices]
top_words_with_tfidf

[('message', 0.6114178544299013),
 ('year', 0.6042506035112247),
 ('donald', 0.45960558180446015),
 ('trump', 0.22318630024004774),
 ('žižek', 0.0),
 ('festivals', 0.0),
 ('fggots', 0.0),
 ('fggot', 0.0),
 ('fewer', 0.0),
 ('fever', 0.0)]