# Topic modeling via BERTopic

In this notebook, we analyze potential differences in topics carried in fake versus real news articles. 

# Read in datasets

In [3]:
from bertopic import BERTopic
import pandas as pd
import numpy as np

# Read in from the Kaggle training dataset.
df = pd.read_csv('../Datasets/fake-news/train.csv', usecols = ['id','title','text','label'])

# Split the training set (df) by real and fake news
df_t = df[df['label'] == 1]
df_f = df[df['label'] == 0]

# Drop rows containing NaN in either title or text columns
df_true = df_t.dropna(subset=['title','text'])
df_fake = df_f.dropna(subset=['title','text'])

# Further split into titles and texts
title_true = df_true.loc[:,'title'].tolist()
text_true = df_true.loc[:,'text'].tolist()
title_fake = df_fake.loc[:,'title'].tolist()
text_fake = df_fake.loc[:,'text'].tolist()

# Compare true/fake news with and without NaN
print("True news: ", df_t.shape[0], "; True news without NaN: ", df_true.shape[0], \
      "\nFake news: ", df_f.shape[0], "; Fake news without NaN: ", df_fake.shape[0])

True news:  10413 ; True news without NaN:  9816 
Fake news:  10387 ; Fake news without NaN:  10387


# BERTopic modeling on titles

We now cluster topics by titles only. The true titles have 3605 outliers, while the fake news titles have 3665 outliers, which is fairly numerous. Here, an "outlier" is with respect to the default BERTopic clustering algorithm. Because of how the clustering works, the clusters may be different with each run of this algorithm.

In [4]:
title_true_topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
title_true_topics, title_true_probs = title_true_topic_model.fit_transform(title_true)

In [5]:
title_fake_topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
title_fake_topics, title_fake_probs = title_fake_topic_model.fit_transform(title_fake)

In [6]:
title_true_topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,3372,-1_hillary_the_trump_to
1,0,225,0_fraud_rigged_voting_voter
2,1,218,1_muslim_migrant_muslims_refugee
3,2,201,2_putin_russia_russian_ukraine
4,3,182,3_michelle_hillary_her_clinton
...,...,...,...
172,171,11,171_laugh_conway_scolds_enablers
173,172,11,172_iran_deal_deals_nuclear
174,173,11,173_cannabis_thc_hemp_traumatic
175,174,10,174_terrorists_strikes_ngos_russian


In [7]:
title_fake_topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,3755,-1_the_new_times_york
1,0,268,0_news_breitbart_cnn_fake
2,1,173,1_women_fashion_beyoncé_female
3,2,166,2_obamacare_paul_ryan_care
4,3,147,3_cartel_mexican_border_texas
...,...,...,...
171,170,11,170_colombia_farc_peace_mudslide
172,171,11,171_brazil_brazilian_corruption_prone
173,172,10,172_christie_bridge_closings_culprit
174,173,10,173_millennials_americans_entrepreneurs_adulthood


## Exploring the titles in true news

In the true news data, Topic 5, 6 and 11 single out Russian, Spanish, and German articles respectively.

In [60]:
title_true_topic_model.visualize_barchart(width=200, height=200, top_n_topics=23, n_words=10,title="Title Clusters in True News")

For instance, Topic 22 is related to healthcare. 

In [10]:
# Filter above data frame by topic 6 only:
topic_true_df = pd.DataFrame({"topic": title_true_topics, "document": title_true})
topic_true = topic_true_df[topic_true_df.topic == 22]
for i in range(16):
    print(topic_true['document'].values[i])

Insurance Prices for Many Obamacare Customers Will Rise By Double Digits in 2017
British Healthcare Offers a Glimpse into the Future of Obamacare
ObamaCare: Things Fall Apart
Obamacare “Near Collapse” in Minnesota as Prices Jump 60% Average
U.S. Health Care Lags Behind, Costs More Than Other Countries
LOL! Obama’s latest ACA train wreck scapegoat is nothing short of laughable
Because of Hillary Clinton, Emergency-Contraception Is Banned In Honduras
Planned Parenthood prepares to fight: VICE News Tonight on HBO (Full Segment) – The Rundown Live
Half of ObamaCare Enrollees Avoid Doctors' Visits to Save Healthcare Costs
Re: WOW! What Josh Earnest admitted about Obamacare is stunning (because it’s true)
BRUTAL! This map shows Obamacare premiums going up as much as 116% in some states
After a Century, Planned Parenthood Needs to be Shut Down
The Real Reason Obamacare is Coming Unglued
INSANITY: Watch O-care architect Jonathan Gruber explain the cost of ‘freedom’
Four-Day Obama Trip in 2013 

We once again see how bad our dataset is: id=2519 contains a title that is an opinion from [Twitchy.org](https://twitchy.com/dougp-3137/2016/10/27/lol-obamas-latest-aca-train-wreck-scapegoat-is-nothing-short-of-laughable/).
In fact, the text indicates this is an opinion piece:

In [133]:
obama = df_true[df_true['title'].str.contains('LOL! Obama', regex=False)]
obama

Unnamed: 0,id,title,text,label
2519,2519,LOL! Obama’s latest ACA train wreck scapegoat ...,"— Derek Hunter (@derekahunter) October 28, 201...",1


In [149]:
df_true[df_true['title'].str.contains('podcast', regex=True)]

Unnamed: 0,id,title,text,label


In [153]:
print(obama.iloc[0,2])

— Derek Hunter (@derekahunter) October 28, 2016 
Obama’s legacy will be to leave everybody laughing for all the wrong reasons. His lap dogs? Cute. https://t.co/jijTNij1M7 
— Hans…boobie… (@deanriehm) October 28, 2016 This is beyond parody. The press had to be shamed into even acknowledging Gruber. https://t.co/GJeyQz2Pvh 
— Jason C. (@CounterMoonbat) October 28, 2016 Perfect.Nothing is ever his fault.Obamacare, Syria, ISIS, Crimea, Ukraine, $19 trillion. @dcexaminer https://t.co/E0fRIXK1N2 
— The 57th State ℅EF™ (@EF517_V2) October 28, 2016 Wasn't it just last week that he said a president can't whine and blame others? #BiggestWhinerAndBlamerEver https://t.co/ZIQIel4aeP 
— Lee Ritz, MD (@lee_ritz) October 28, 2016 Yes, the press made Obamacare fail. 
Or…… maybe it was the horrific plan itself that failed on its own. https://t.co/a4qXISwjtR 
— Paul Crisp (@pcrispy) October 28, 2016 You'll find this in the dictionary under "Bite the hand that feeds you" https://t.co/lsherhCCWF 
—

In fact if you go to the website article here is part of the article (a series of tweets) before getting cut off:
    We know that President Obama and the Democrats have already blasted Republicans for not helping “fix” the unfolding disaster that is Obamacare, and the White House has set its sights on somebody else to blame:

    Obama blames the press for #Obamacare trouble https://t.co/CyZsiRZMDS pic.twitter.com/ZL21B6sLT3

    — Washington Examiner (@dcexaminer) October 27, 2016

WOW, he’s really running out of fingers to point.

Washington Examiner:

    President Obama told activists they need to fight through a wave of negative press stories about Obamacare this year to ensure enrollment numbers go up, and said plans are still affordable despite stories saying premiums will rise sharply in 2017.

    “We’re not going to get that much help from the media,” Obama told the more than 25,000 volunteers who joined a White House call with Obama Thursday afternoon. “This is going to be a ground game.”

Ha! Sure, because the mainstream media’s always been working against the Obama administration, right!?

    The press is sending those premium hike letters, canceling people's plans, and forcing insurers out of the exchanges?? Wow! ? #whining https://t.co/0b2ffrdX7I

    — Guy Benson (@guypbenson) October 28, 2016

    Oh. I guess the press wrote a dumpster fire law, passed it with trickery in the middle of night, after lying about it. Interesting take. https://t.co/kvVjLaIKZT

    — Heather (@hboulware) October 28, 2016

    Yesterday it was Republicans' fault, now it's the media. Total number of media and Republicans who voted for/had any input in Obamacare = 0. https://t.co/rGjeM95bL4

    — Derek Hunter (@derekahunter) October 28, 2016

## Exploring the titles in fake news

In [61]:
title_fake_topic_model.visualize_barchart(width=200, height=200, top_n_topics=20, n_words=10,title="Title Clusters in Fake News")

When we look at Topic 0, Breitbart comes up. Turns out, this dataset contains also the headline news source sometimes.

In [14]:
# Filter above data frame by topic 6 only:
topic_fake_df = pd.DataFrame({"topic": title_fake_topics, "document": title_fake})
topic_fake = topic_fake_df[topic_fake_df.topic == 0]
for i in range(50):
    print(topic_fake['document'].values[i])

Chuck Todd to BuzzFeed EIC: ’You Just Published Fake News’ - Breitbart
Breitbart News Daily: Trump Boom - Breitbart
TV Anchors Arrive at the White House for Lunch with Donald Trump - Breitbart
Pelosi: Republicans Should Tell Trump He’s ’Bringing Dishonor’ to the Presidency - Breitbart
CNN Statement Distances Network from Buzzfeed Fake News Dossier - Breitbart
Pew: American Trust Level in Federal Government Plummets to Historic Lows - Breitbart
Al Sharpton to Dems: No Point Appealing to ‘Archie Bunker’ Trump Voters - Breitbart
Atlantic’s Goldberg: I’m ‘Not Confident’ Trump Can Handle ‘Matters of Life and Death’ - Breitbart
The Vanquished to Witness the Takeover: Bushes, Clintons Will Attend Donald Trump’s Inauguration - Breitbart
Trump at Inaugural Balls: ’Now the Work Begins ... We Are Not Playing Games’ - Breitbart
Media Outrage over White House ’Exclusion’ is Fake News - Breitbart
There’s Only One Trump Administration Position That’s Gaining Popularity And It’s Going To Shock You - B

## Exploring article texts in true news

In [16]:
text_true_topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
text_true_topics, text_true_probs = text_true_topic_model.fit_transform(text_true)

In [17]:
text_fake_topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
text_fake_topics, text_fake_probs = text_fake_topic_model.fit_transform(text_fake)

In [18]:
text_true_topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,3272,-1_the_of_and_to
1,0,196,0_pipeline_dakota_standing_rock
2,1,182,1_comey_fbi_investigation_director
3,2,170,2_you_your_we_our
4,3,161,3_aleppo_syrian_syria_al
...,...,...,...
183,182,10,182_mi5_parker_guardian_russia
184,183,10,183_intelligence_attacks_qaeda_threat
185,184,10,184_black_blacks_haiti_african
186,185,10,185_rally_hillary_crowd_rallies


In [19]:
text_fake_topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,3700,-1_and_the_in_of
1,0,286,0_health_insurance_obamacare_care
2,1,151,1_syria_syrian_assad_aleppo
3,2,117,2_intelligence_nunes_trump_surveillance
4,3,113,3_police_officers_officer_shooting
...,...,...,...
187,186,11,186_merkel_germany_schulz_chancellor
188,187,11,187_bannon_security_national_council
189,188,10,188_adnani_syria_islamic_qaeda
190,189,10,189_school_students_student_arundel


In [64]:
text_true_topic_model.visualize_barchart(width=200, height=200, top_n_topics=20, n_words=10,title="Text Clusters in True News")

In [29]:
# Filter above data frame by topic 2 only:
topic_true_df = pd.DataFrame({"topic": text_true_topics, "document": text_true, "title": title_true})
topic_true = topic_true_df[topic_true_df.topic == 2]
for i in range(10):
    print(topic_true['title'].values[i], "\n\n", topic_true['document'].values[i], "\n------\n")

Sparking An Inner Revolution 

 0 0 With humanity’s awakening continuing to speed up as every day passes, we see more and more people desiring information to help them better themselves. Whether that is done through eating healthier foods, beginning a yoga or meditation practice, or whether it is done by being more mindful with the words they choose to use when having an inner dialogue or a conversation with friends and family, more people are looking for information to help them transform their lives in a positive way. While there are many methods to improving one’s life, below are some methods that can help a person spark an inner revolution to create positive and uplifting change in his or her world. Grounding/Earthing Grounding, or what is also called Earthing is when a person has bare skin touching the Earth or a tree and is most commonly done by standing on the Earth with one’s bare feet. Doing this begins to balance the electrochemical state of the body because of the negative i

The true news article ["Hidden in plain sight – The global depopulation agenda"](https://themadtruther.com/2016/10/20/hidden-in-plain-sight-the-global-depopulation-agenda/) is a conspiracy theory. The [about page](https://themadtruther.com/about/) of the source website clearly lists the author is interested in UFOs and to "wake people up". Their top 5 posts include
- Video: Pfizer’s “Secret” Report on the Covid Vaccine. Beyond Manslaughter. The Evidence is Overwhelming.
- How Are Hunter Biden, Klaus Schwab & CIA Connected To US Biolabs In Ukraine?

## Exploring article text in fake news

In [66]:
text_fake_topic_model.visualize_barchart(width=200, height=200, top_n_topics=20, n_words=10,title="Text Clusters in Fake News")

In [31]:
# Filter above data frame by topic 2 only:
topic_fake_df = pd.DataFrame({"topic": text_fake_topics, "document": text_fake, "title": title_fake})
topic_fake = topic_fake_df[topic_fake_df.topic == 2]
for i in range(10):
    print(topic_fake['title'].values[i], "\n\n", topic_fake['document'].values[i], "\n------\n")

Woodward: Trump Dossier Is a ’Garbage Document’ - Intelligence Chiefs Should ’Apologize’ to Trump - Breitbart 

 On this weekend’s broadcast of “Fox New Sunday,” veteran journalist Bob Woodward said the unverified dossier about   Donald Trump and Russia is a “garbage document. ”  Woodward said, “I think what is under reported here is Trump’s point of view on it. You laid it out when those former CIA people said these things about Trump, that he was a recruited agent of the Russians, and a useful fool, they started this in Trump’s mind, He knows the old adage, once a CIA man, always a CIA man. No one came out and said those people shouldn’t be saying those things, So act two is the briefing when this dossier is put out. ” “I’ve lived in this world for 45 years where you get things and people make allegations, that is a garbage document,” he continued. “It never should have been presented as part of an intelligence briefing. As you suggested, other channels have the White House counsel g

The above articles in Topic 2 draws heavily from Breitbart.

In [32]:
# Filter above data frame by topic 4 only:
topic_fake_df = pd.DataFrame({"topic": text_fake_topics, "document": text_fake, "title": title_fake})
topic_fake = topic_fake_df[topic_fake_df.topic == 4]
for i in range(10):
    print(topic_fake['title'].values[i], "\n\n", topic_fake['document'].values[i], "\n------\n")

After Berkeley, Treat the Violent, Anti-Speech Left Like the KKK 

 The violence that stopped Breitbart tech editor Milo Yiannopoulos from speaking at the University of California, Berkeley this week happened after some incredibly irresponsible statements by university administrators and local politicians. [Berkeley Mayor Jesse Arreguin tweeted that Milo’s “hate speech” was not “welcome in our community. ” UC Berkeley Chancellor Nicholas Dirks said that the university not only opposed Milo’s views, but also his ostensibly harmful presence on campus.  Yet most of the blame must lie squarely with the rioters, who include Black Bloc anarchists and   “Antifa” ( ) activists, who exemplify the very fascism they supposedly want to resist. These groups openly and explicitly declare their intention to disrupt public gatherings where conservatives  —   or, really, anyone they do not like for whatever reason  —   are scheduled to appear. They not only celebrate violence, but they come armed, mask

# Instances of news with Breitbart in title

In [36]:
breitbart = df_true[df_true['title'].str.contains('Breitbart', regex=True)]
breitbart

Unnamed: 0,id,title,text,label
11413,11413,"Samantha Bee: Who Gives A F**K About Trump, Th...","Samantha Bee: Who Gives A F**K About Trump, Th...",1
12106,12106,Statistical Tie: Latest Breitbart/Gravis Poll ...,Statistical Tie: Latest Breitbart/Gravis Poll ...,1
17908,17908,Giuliani Defends Breitbart News Against MSNBC ...,"\r\nWednesday on MSNBC, former New York City M...",1
18119,18119,Donald Trump: Hillary's Syria Policy Would Lea...,Donald Trump: Hillary’s Syria Policy Would Lea...,1


In [44]:
breitbart = df_fake[df_fake['title'].str.contains('Breitbart', regex=True)]
breitbart.shape[0]

2352

In [45]:
breitbart = df_fake[df_fake['title'].str.contains('- Breitbart', regex=True)]
breitbart.shape[0]

2338

From the 10387 fake news data, 2338 - 2352 of these (or ~22.5-22.6%) are explicitly from Breitbart.

In [39]:
nyt = df_fake[df_fake['title'].str.contains('NYT', regex=True)]
nyt.shape[0]

24

In [42]:
nyt = df_true[df_true['title'].str.contains('NYT', regex=True)]
nyt.shape[0]

22

In [67]:
npr = df_true[df_true['title'].str.contains('NPR', regex=True)]
npr.shape[0]

2

In [68]:
npr = df_fake[df_fake['title'].str.contains('NPR', regex=True)]
npr.shape[0]

4

In [69]:
nyt = df_true[df_true['title'].str.contains('New York Times', regex=True)]
nyt.shape[0]

13

In [71]:
nyt_fake = df_fake[df_fake['title'].str.contains('New York Times', regex=True)]
nyt_fake.shape[0]

6235

In [72]:
nyt_fake.head()

Unnamed: 0,id,title,text,label
7,7,Benoît Hamon Wins French Socialist Party’s Pre...,"PARIS — France chose an idealistic, traditi...",0
8,8,Excerpts From a Draft Script for Donald Trump’...,Donald J. Trump is scheduled to make a highly ...,0
9,9,"A Back-Channel Plan for Ukraine and Russia, Co...",A week before Michael T. Flynn resigned as nat...,0
15,15,"In Major League Soccer, Argentines Find a Home...",Guillermo Barros Schelotto was not the first A...,0
16,16,Wells Fargo Chief Abruptly Steps Down - The Ne...,The scandal engulfing Wells Fargo toppled its ...,0


In [86]:
for i in range(100):
    print(nyt_fake['title'].iloc[i])

Benoît Hamon Wins French Socialist Party’s Presidential Nomination - The New York Times
Excerpts From a Draft Script for Donald Trump’s Q&ampA With a Black Church’s Pastor - The New York Times
A Back-Channel Plan for Ukraine and Russia, Courtesy of Trump Associates - The New York Times
In Major League Soccer, Argentines Find a Home and Success - The New York Times
Wells Fargo Chief Abruptly Steps Down - The New York Times
Abortion Pill Orders Rise in 7 Latin American Nations on Zika Alert - The New York Times
Andrea Tantaros of Fox News Claims Retaliation for Sex Harassment Complaints - The New York Times
How Hillary Clinton Became a Hawk - The New York Times
Having Won, Boris Johnson and ‘Brexit’ Leaders Fumble - The New York Times
Texas Oil Fields Rebound From Price Lull, but Jobs Are Left Behind - The New York Times
Bayer Deal for Monsanto Follows Agribusiness Trend, Raising Worries for Farmers - The New York Times
Russia Moves to Ban Jehovah’s Witnesses as ‘Extremist’ - The New Yor

In [99]:
other_fake = df_fake[df_fake['title'].str.contains('Reddit', regex=True)]
other_fake.shape[0]

2

In [100]:
other_true = df_true[df_true['title'].str.contains('Reddit', regex=True)]
other_true.shape[0]

3

In [111]:
for i in range(5700,5800):
    print(df_true['title'].iloc[i])

Americans Are So Disconnected from Reality That “Insouciant” Has Become An Euphemism
Watch Dr. Duke’s powerful new television commercial!
The Foul Stench of Fascism in the US    : Information
WHITE FLIGHT? Or is it white fright? British multiculturalism has created segregation in towns where the white population is fleeing as the Muslim population is exploding
Thousands of Wild Buffalo Appear Out of Nowhere at Standing Rock (VIDEO)
Shine Brightly in the Consuming Fire of Divine Love
Anti-War Movement Anticipates More War Under A Clinton Presidency
Tapper Pushes 'Conflict Of Interest' Narrative
Endlich: Esso-Tankstellen bieten jetzt auch veganes Benzin an
Why hydrogen peroxide should be in every home
Statistical Tie: Latest Breitbart/Gravis Poll Shows Donald Trump Closes the Gap with Less Than Two Weeks Left
Desarticulan una red criminal que ofrecía a Arturo Pérez-Reverte para dar palizas
Trump Proudly Declares: Most Of The People I’ve Insulted Deserved It
Riot Police Fire Water Cannon,

True news draws from: National review, The Onion, Press TV, RedFlag News, Katehon think tank, truthdig, etc.