# Fake News Filter

> **Note**: Some of the steps in this code (mostly those related to dataset cleaning) will take a significant amount of time.



## Setup

We'll start importing the datasets and all the tools we'll need to create the filter:

In [1]:
!wget https://proai-datasets.s3.eu-west-3.amazonaws.com/fake_news.zip
!unzip fake_news.zip

--2024-06-23 12:14:29--  https://proai-datasets.s3.eu-west-3.amazonaws.com/fake_news.zip
Resolving proai-datasets.s3.eu-west-3.amazonaws.com (proai-datasets.s3.eu-west-3.amazonaws.com)... 3.5.224.12, 3.5.225.18
Connecting to proai-datasets.s3.eu-west-3.amazonaws.com (proai-datasets.s3.eu-west-3.amazonaws.com)|3.5.224.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42975911 (41M) [application/zip]
Saving to: ‘fake_news.zip’


2024-06-23 12:14:38 (5.28 MB/s) - ‘fake_news.zip’ saved [42975911/42975911]

Archive:  fake_news.zip
  inflating: Fake.csv                
  inflating: True.csv                


In [2]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import spacy
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
import gensim
from gensim.utils import simple_preprocess
import gensim.corpora as corpora
from pprint import pprint
from collections import Counter
import numpy as np

In [3]:
df_true = pd.read_csv("True.csv")
df_true.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB


In [4]:
df_fake = pd.read_csv("Fake.csv")
df_fake.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB


## Preliminary steps

Let's merge the two datasets:

In [5]:
df_true['source'] = 'true'
df_fake['source'] = 'fake'
df_true_subset = df_true[['title', 'text', 'source']]
df_fake_subset = df_fake[['title', 'text', 'source']]
df_news = pd.concat([df_true_subset, df_fake_subset], ignore_index=True)
df_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   44898 non-null  object
 1   text    44898 non-null  object
 2   source  44898 non-null  object
dtypes: object(3)
memory usage: 1.0+ MB


Let's create the function for cleaning and preprocessing the dataset:

In [6]:
nltk.download('stopwords')
english_stopwords = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_sm')
punctuation=set(string.punctuation)

def data_cleaner(dataset):

    def filter_tokens(text):
        doc = nlp(text)
        filtered_tokens = [token.text for token in doc if token.pos_ not in ['PRON', 'VERB', 'ADV', 'AUX', 'ADP']]
        return ' '.join(filtered_tokens)

    dataset_to_return = []
    for sentence in dataset:
        sentence = sentence.lower()
        sentence = ''.join([char for char in sentence if char not in string.punctuation])
        sentence = ' '.join(word for word in sentence.split() if word not in english_stopwords)
        sentence = re.sub('\d', '', sentence)
        sentence = ' '.join(word for word in sentence.split() if len(word) > 3)
        sentence = filter_tokens(sentence)
        dataset_to_return.append(sentence)

    return dataset_to_return

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Let's create the function for vectorizing using TF-IDF method:

In [7]:
vectorizer=TfidfVectorizer()
def bow_tfidf(dataset, vectorizer):
  if vectorizer==None:
    vectorizer=TfidfVectorizer()
    X=vectorizer.fit_transform(dataset)
  else:
    X=vectorizer.transform(dataset)
  return X.toarray(),vectorizer

Let's clean the "text" attribute of the dataframe:

In [8]:
news_text_cleaned=data_cleaner(df_news['text'])

In [9]:
with open('news_text_cleaned.pkl', 'wb') as f:
    pickle.dump(news_text_cleaned, f)

In [10]:
with open('news_text_cleaned.pkl', 'rb') as f:
    news_text_cleaned = pickle.load(f)

In [11]:
df_news['text_cleaned']=news_text_cleaned

Now we create test and train subset for our model:

In [12]:
df_1,df_2=train_test_split(df_news,test_size=0.30,random_state=69)
df_train,df_test=train_test_split(df_1,test_size=0.25,random_state=69)

In [13]:
train_news_cleaned,vectorized=bow_tfidf(df_train['text_cleaned'], None)
train_news_cleaned

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [14]:
len(df_train[df_train['source']=='fake'])

12396

In [15]:
len(df_train[df_train['source']=='true'])

11175

For our model we will use a logistic function as activation function and a tolerance of 0.01

In [16]:
clf= MLPClassifier(activation='logistic',
                   solver='adam',
                   max_iter=50,
                   hidden_layer_sizes=(5),
                   tol=0.01,
                   verbose=True
                   )

In [17]:
clf.fit(train_news_cleaned,df_train['source'])

Iteration 1, loss = 0.69882986
Iteration 2, loss = 0.66039019
Iteration 3, loss = 0.61882276
Iteration 4, loss = 0.55239826
Iteration 5, loss = 0.47337593
Iteration 6, loss = 0.40080422
Iteration 7, loss = 0.34107603
Iteration 8, loss = 0.29326529
Iteration 9, loss = 0.25490825
Iteration 10, loss = 0.22376220
Iteration 11, loss = 0.19807221
Iteration 12, loss = 0.17658218
Iteration 13, loss = 0.15840729
Iteration 14, loss = 0.14287395
Iteration 15, loss = 0.12947865
Iteration 16, loss = 0.11784900
Iteration 17, loss = 0.10766520
Iteration 18, loss = 0.09873065
Iteration 19, loss = 0.09081305
Iteration 20, loss = 0.08378658
Iteration 21, loss = 0.07751796
Iteration 22, loss = 0.07189704
Iteration 23, loss = 0.06685980
Iteration 24, loss = 0.06230687
Iteration 25, loss = 0.05818294
Iteration 26, loss = 0.05445468
Iteration 27, loss = 0.05105625
Iteration 28, loss = 0.04796407
Training loss did not improve more than tol=0.010000 for 10 consecutive epochs. Stopping.


In [18]:
with open('fake_news_filter.pkl', 'wb') as f:
    pickle.dump(clf, f)

In [19]:
with open('fake_news_filter.pkl', 'rb') as f:
    clf = pickle.load(f)

In [20]:
test_news_cleaned,vectorized=bow_tfidf(df_test['text_cleaned'], vectorized)
test_news_cleaned

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [21]:
clf.score(test_news_cleaned,df_test['source'])

0.9758177421407662

Now let's use the cross-validation method (3 fold):

In [22]:
X_vectorized = vectorizer.fit_transform(df_news['text_cleaned'])
cross_val_scores = cross_val_score(clf, X_vectorized, df_news['source'], cv=3, scoring='accuracy')

Iteration 1, loss = 0.72703356
Iteration 2, loss = 0.64231695
Iteration 3, loss = 0.57615442
Iteration 4, loss = 0.49963849
Iteration 5, loss = 0.42086491
Iteration 6, loss = 0.35098994
Iteration 7, loss = 0.29387485
Iteration 8, loss = 0.24861299
Iteration 9, loss = 0.21281567
Iteration 10, loss = 0.18424904
Iteration 11, loss = 0.16111796
Iteration 12, loss = 0.14215928
Iteration 13, loss = 0.12645354
Iteration 14, loss = 0.11324165
Iteration 15, loss = 0.10199234
Iteration 16, loss = 0.09233744
Iteration 17, loss = 0.08399503
Iteration 18, loss = 0.07669869
Iteration 19, loss = 0.07029189
Iteration 20, loss = 0.06462423
Iteration 21, loss = 0.05959983
Iteration 22, loss = 0.05512872
Iteration 23, loss = 0.05112702
Iteration 24, loss = 0.04754076
Iteration 25, loss = 0.04430702
Iteration 26, loss = 0.04138709
Training loss did not improve more than tol=0.010000 for 10 consecutive epochs. Stopping.
Iteration 1, loss = 0.67644343
Iteration 2, loss = 0.61410845
Iteration 3, loss = 0.513

In [23]:
with open('cross_val_scores.pkl', 'wb') as f:
    pickle.dump(cross_val_scores, f)

In [24]:
with open('cross_val_scores.pkl', 'rb') as f:
    cross_val_scores = pickle.load(f)

In [25]:
cross_val_scores

array([0.93338233, 0.92302552, 0.96291594])

In [26]:
cross_val_scores.mean()

0.9397746002049089

## Fake Dataset Analysis

Which categories are most impacted by fake news?

In [27]:
df_fake['subject'].value_counts()

subject
News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: count, dtype: int64

What are the most affected topics within each category?

In [28]:
fake_text_cleaned=data_cleaner(df_fake['text'])

In [29]:
with open('fake_text_cleaned.pkl', 'wb') as f:
    pickle.dump(fake_text_cleaned, f)

In [30]:
with open('fake_text_cleaned.pkl', 'rb') as f:
    fake_text_cleaned = pickle.load(f)

In [31]:
df_fake['text_cleaned']=fake_text_cleaned

In [32]:
def sent_to_words(items):
  for item in items:
    yield(simple_preprocess(item,deacc=True)) #simple_preprocess convert a document into a list of tokens

Let's create a function that builds an LDA model to extract the main topics and assigns them to the documents in the corpus.

In [33]:
def topic_modeling(subject, num_topics=5, passes=10):

    data_words = list(sent_to_words(subject))

    id2word = corpora.Dictionary(data_words)

    corpus = [id2word.doc2bow(text) for text in data_words]

    lda_model = gensim.models.LdaMulticore(
        corpus=corpus,
        id2word=id2word,
        num_topics=num_topics,
        passes=passes
    )

    for topic_num in range(num_topics):
        pprint(lda_model.print_topic(topic_num))

    fake_lda = lda_model[corpus]

    return lda_model, fake_lda


In [34]:
news_fake = df_fake[df_fake['subject'] == 'News']['text_cleaned']
politics_fake = df_fake[df_fake['subject'] == 'politics']['text_cleaned']
left_fake = df_fake[df_fake['subject'] == 'left-news']['text_cleaned']
government_fake = df_fake[df_fake['subject'] == 'Government News']['text_cleaned']
us_fake = df_fake[df_fake['subject'] == 'US_News']['text_cleaned']
middleeast_fake = df_fake[df_fake['subject'] == 'Middle-east']['text_cleaned']

In [35]:
topic_modeling(news_fake)

('0.032*"trump" + 0.010*"people" + 0.009*"donald" + 0.008*"clinton" + '
 '0.007*"campaign" + 0.006*"republican" + 0.006*"image" + 0.006*"women" + '
 '0.005*"hillary" + 0.005*"party"')
('0.058*"trump" + 0.014*"president" + 0.012*"donald" + 0.007*"image" + '
 '0.006*"people" + 0.006*"obama" + 0.006*"house" + 0.005*"white" + '
 '0.005*"time" + 0.005*"russia"')
('0.015*"people" + 0.007*"president" + 0.007*"republicans" + 0.006*"image" + '
 '0.005*"obama" + 0.005*"realdonaldtrump" + 0.005*"trump" + 0.005*"court" + '
 '0.004*"january" + 0.004*"time"')
('0.009*"police" + 0.007*"people" + 0.006*"state" + 0.006*"image" + '
 '0.005*"black" + 0.005*"clinton" + 0.004*"time" + 0.004*"women" + '
 '0.004*"hillary" + 0.004*"trump"')
('0.029*"trump" + 0.010*"people" + 0.009*"donald" + 0.007*"president" + '
 '0.007*"image" + 0.006*"white" + 0.006*"obama" + 0.005*"cruz" + 0.004*"time" '
 '+ 0.004*"video"')


(<gensim.models.ldamulticore.LdaMulticore at 0x7d1338007c40>,
 <gensim.interfaces.TransformedCorpus at 0x7d18b3eb2ad0>)

In [36]:
topic_modeling(politics_fake)

('0.009*"trump" + 0.008*"president" + 0.007*"police" + 0.005*"people" + '
 '0.003*"state" + 0.003*"county" + 0.003*"time" + 0.003*"party" + '
 '0.003*"american" + 0.003*"north"')
('0.011*"hillary" + 0.008*"trump" + 0.007*"clinton" + 0.007*"people" + '
 '0.007*"women" + 0.006*"president" + 0.004*"video" + 0.004*"sexual" + '
 '0.004*"first" + 0.004*"time"')
('0.008*"people" + 0.007*"black" + 0.006*"police" + 0.005*"white" + '
 '0.005*"obama" + 0.004*"students" + 0.004*"muslim" + 0.004*"america" + '
 '0.004*"school" + 0.004*"state"')
('0.018*"clinton" + 0.011*"obama" + 0.009*"president" + 0.008*"hillary" + '
 '0.008*"trump" + 0.007*"state" + 0.005*"department" + 0.005*"campaign" + '
 '0.004*"former" + 0.004*"government"')
('0.029*"trump" + 0.011*"president" + 0.007*"donald" + 0.007*"people" + '
 '0.006*"news" + 0.005*"house" + 0.005*"republican" + 0.005*"hillary" + '
 '0.004*"media" + 0.004*"obama"')


(<gensim.models.ldamulticore.LdaMulticore at 0x7d18b3eb31f0>,
 <gensim.interfaces.TransformedCorpus at 0x7d1337afb700>)

In [37]:
topic_modeling(left_fake)

('0.014*"clinton" + 0.009*"hillary" + 0.008*"trump" + 0.006*"president" + '
 '0.005*"state" + 0.005*"news" + 0.005*"people" + 0.005*"obama" + '
 '0.005*"house" + 0.004*"bill"')
('0.011*"police" + 0.006*"people" + 0.005*"city" + 0.004*"students" + '
 '0.004*"school" + 0.004*"state" + 0.004*"black" + 0.003*"white" + '
 '0.003*"officers" + 0.003*"time"')
('0.008*"trump" + 0.007*"president" + 0.006*"hillary" + 0.006*"obama" + '
 '0.005*"clinton" + 0.005*"people" + 0.004*"news" + 0.004*"america" + '
 '0.004*"american" + 0.004*"states"')
('0.008*"people" + 0.004*"video" + 0.004*"news" + 0.004*"women" + '
 '0.004*"white" + 0.003*"police" + 0.003*"time" + 0.003*"trump" + '
 '0.003*"children" + 0.003*"school"')
('0.022*"trump" + 0.013*"president" + 0.010*"obama" + 0.009*"black" + '
 '0.008*"people" + 0.006*"white" + 0.005*"donald" + 0.004*"america" + '
 '0.004*"police" + 0.004*"media"')


(<gensim.models.ldamulticore.LdaMulticore at 0x7d1338004df0>,
 <gensim.interfaces.TransformedCorpus at 0x7d1337af8ca0>)

In [38]:
topic_modeling(government_fake)

('0.016*"obama" + 0.007*"president" + 0.006*"court" + 0.005*"people" + '
 '0.005*"iran" + 0.004*"hillary" + 0.004*"clinton" + 0.004*"house" + '
 '0.003*"time" + 0.003*"white"')
('0.009*"president" + 0.006*"house" + 0.006*"obama" + 0.005*"government" + '
 '0.004*"people" + 0.004*"federal" + 0.004*"american" + 0.004*"bill" + '
 '0.004*"trump" + 0.004*"first"')
('0.009*"people" + 0.005*"obama" + 0.004*"government" + 0.004*"president" + '
 '0.004*"million" + 0.004*"american" + 0.004*"state" + 0.004*"country" + '
 '0.003*"federal" + 0.003*"program"')
('0.014*"clinton" + 0.007*"department" + 0.007*"state" + 0.005*"hillary" + '
 '0.005*"people" + 0.005*"investigation" + 0.005*"email" + 0.004*"government" '
 '+ 0.004*"emails" + 0.004*"federal"')
('0.007*"state" + 0.007*"obama" + 0.006*"trump" + 0.006*"president" + '
 '0.006*"united" + 0.005*"states" + 0.005*"refugees" + 0.005*"government" + '
 '0.004*"people" + 0.004*"america"')


(<gensim.models.ldamulticore.LdaMulticore at 0x7d1338006170>,
 <gensim.interfaces.TransformedCorpus at 0x7d1337af8a00>)

In [39]:
topic_modeling(us_fake)

('0.007*"trump" + 0.004*"rubio" + 0.004*"wire" + 0.004*"century" + '
 '0.004*"federal" + 0.004*"media" + 0.003*"president" + 0.003*"finicum" + '
 '0.003*"government" + 0.003*"obama"')
('0.010*"wire" + 0.008*"syria" + 0.007*"room" + 0.007*"boiler" + 0.006*"radio" '
 '+ 0.006*"political" + 0.005*"media" + 0.005*"episode" + 0.004*"century" + '
 '0.004*"broadcast"')
('0.015*"trump" + 0.011*"clinton" + 0.010*"news" + 0.010*"media" + '
 '0.007*"wire" + 0.006*"russia" + 0.006*"election" + 0.006*"president" + '
 '0.006*"hillary" + 0.005*"century"')
('0.007*"media" + 0.006*"government" + 0.005*"state" + 0.005*"wire" + '
 '0.005*"people" + 0.005*"security" + 0.004*"news" + 0.004*"many" + '
 '0.004*"century" + 0.004*"shooter"')
('0.007*"story" + 0.007*"syria" + 0.005*"wire" + 0.004*"media" + '
 '0.004*"military" + 0.004*"syrian" + 0.004*"government" + 0.003*"news" + '
 '0.003*"century" + 0.003*"president"')


(<gensim.models.ldamulticore.LdaMulticore at 0x7d1337af9750>,
 <gensim.interfaces.TransformedCorpus at 0x7d1337af8520>)

In [40]:
topic_modeling(middleeast_fake)

('0.007*"media" + 0.006*"story" + 0.006*"security" + 0.005*"wire" + '
 '0.005*"government" + 0.005*"shooter" + 0.004*"attack" + 0.004*"intelligence" '
 '+ 0.004*"police" + 0.004*"shooting"')
('0.011*"room" + 0.011*"boiler" + 0.009*"radio" + 0.008*"wire" + '
 '0.008*"political" + 0.007*"episode" + 0.007*"broadcast" + 0.006*"live" + '
 '0.005*"current" + 0.005*"another"')
('0.015*"trump" + 0.011*"wire" + 0.007*"century" + 0.006*"news" + '
 '0.005*"clinton" + 0.005*"president" + 0.005*"media" + 0.005*"political" + '
 '0.004*"state" + 0.004*"donald"')
('0.010*"syria" + 0.009*"trump" + 0.007*"clinton" + 0.007*"news" + '
 '0.006*"president" + 0.006*"russia" + 0.006*"media" + 0.005*"washington" + '
 '0.005*"government" + 0.005*"wire"')
('0.011*"media" + 0.009*"trump" + 0.009*"news" + 0.007*"clinton" + '
 '0.006*"election" + 0.006*"wire" + 0.005*"century" + 0.005*"party" + '
 '0.004*"political" + 0.004*"hillary"')


(<gensim.models.ldamulticore.LdaMulticore at 0x7d1337afb1c0>,
 <gensim.interfaces.TransformedCorpus at 0x7d1337af83d0>)

In [41]:
top_negative_words_by_subject = {}
for subject in df_fake['subject'].unique():
    subset = df_fake[df_fake['subject'] == subject]
    combined_text = ' '.join(subset['text_cleaned'])
    words = combined_text.split()
    word_counts = Counter(words)
    top_words = [word for word, _ in word_counts.most_common(10)]
    top_negative_words = []
    for word in top_words:
        probabilities = clf.predict_proba(bow_tfidf([word],vectorized)[0])
        negative_index = list(clf.classes_).index('fake')
        negative_percentage = probabilities[0][negative_index] *100
        top_negative_words.append((word, negative_percentage))
    top_negative_words.sort(key=lambda x: x[1], reverse=True)
    top_negative_words = top_negative_words[:5]
    top_negative_words_by_subject[subject] = top_negative_words

for subject, words in top_negative_words_by_subject.items():
    print(f"Subject: {subject}")
    for word, percentage in words:
        print(f"Word: {word}, fake %: {percentage:.2f}%")
    print("\n")

Subject: News
Word: image, fake %: 99.30%
Word: obama, fake %: 98.54%
Word: time, fake %: 98.40%
Word: people, fake %: 91.30%
Word: clinton, fake %: 87.55%


Subject: politics
Word: hillary, fake %: 98.81%
Word: obama, fake %: 98.54%
Word: time, fake %: 98.40%
Word: news, fake %: 97.27%
Word: people, fake %: 91.30%


Subject: Government News
Word: obama, fake %: 98.54%
Word: people, fake %: 91.30%
Word: clinton, fake %: 87.55%
Word: department, fake %: 85.85%
Word: trump, fake %: 84.06%


Subject: left-news
Word: hillary, fake %: 98.81%
Word: obama, fake %: 98.54%
Word: black, fake %: 97.36%
Word: news, fake %: 97.27%
Word: people, fake %: 91.30%


Subject: US_News
Word: wire, fake %: 99.18%
Word: century, fake %: 99.15%
Word: news, fake %: 97.27%
Word: syria, fake %: 91.77%
Word: media, fake %: 88.79%


Subject: Middle-east
Word: wire, fake %: 99.18%
Word: century, fake %: 99.15%
Word: news, fake %: 97.27%
Word: syria, fake %: 91.77%
Word: media, fake %: 88.79%




### Titles Analysis

Can we spot patterns in Fake News titles?

In [42]:
fake_title_cleaned=data_cleaner(df_fake['title'])

In [43]:
with open('fake_title_cleaned.pkl', 'wb') as f:
    pickle.dump(fake_title_cleaned, f)

In [44]:
with open('fake_title_cleaned.pkl', 'rb') as f:
    fake_title_cleaned = pickle.load(f)

In [45]:
title_words = Counter(word for sublist in sent_to_words(fake_title_cleaned) for word in sublist)
top_words = title_words.most_common(10)
top_words

[('video', 8233),
 ('trump', 7907),
 ('obama', 2542),
 ('hillary', 2270),
 ('clinton', 1118),
 ('president', 1113),
 ('black', 876),
 ('news', 871),
 ('white', 853),
 ('donald', 818)]

In [46]:
title_vectorized,vectorized=bow_tfidf(fake_title_cleaned, None)
title_vectorized

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [47]:
top_indices = np.argsort(title_vectorized)[0][::-1][:10]
top_words_with_tfidf = [(vectorized.get_feature_names_out()[index], title_vectorized[0, index]) for index in top_indices]
top_words_with_tfidf

[('embarrassing', 0.5787271209859361),
 ('message', 0.5002059241600817),
 ('year', 0.49304449745847057),
 ('donald', 0.37237534192879784),
 ('trump', 0.18196890151842793),
 ('fetus', 0.0),
 ('fetal', 0.0),
 ('festivals', 0.0),
 ('festival', 0.0),
 ('fertile', 0.0)]