### Imports

In [345]:
import pandas as pd
import altair as alt
import numpy as np

# Statement of Purpose / General Information

The purpose of this Jupyter notebook is the inspection and visualization of two datasets, one featuring factual news articles and another featuring misinformative ones. The subject matter of the datasets varies but generally revolves around ex-president Donald Trump and his shenanigans; all articles that do not contain the world "Trump" will be removed from the data before visualization. 

There are two important details about the dataset (source: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset) that should be discussed prior to any investigation of the data. 

> Firstly, the dataset is a somewhat-credible source of information. It does not disclose the source of the articles, nor does it mention the criteria used to classify misinformative articles. The dataset's credibility is partially redeemed in that it is held in high regard by the Kaggle community (1000+ upvotes and 46 thousand downloads). 

> Secondly, almost all of the "true" dataset (not misinformative) consists of articles written by Reuters. Independent fact-checkers rate Reuters as "least biased" and consider its factual reporting to be "very high" (source: https://mediabiasfactcheck.com/reuters/), but the dataset is likely not wholly representative of news on Donald Trump regardless. 

### Import Dataset

In [346]:
fake_dataset = pd.read_csv("/Users/isaacheitmann/Downloads/News _dataset/Fake.csv")
true_dataset = pd.read_csv("/Users/isaacheitmann/Downloads/News _dataset/True.csv")

In [347]:
# inspect
fake_dataset

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"
...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016"
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016"
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016"
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016"


In [348]:
true_dataset

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"
...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017"
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017"
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017"
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017"


In [349]:
# what articles are not from Reuters? 
true_dataset[true_dataset["text"].apply(lambda x: "Reuters" not in x)]

Unnamed: 0,title,text,subject,date
103,Democratic U.S. senator seeks audit of EPA chi...,WASHINGTON () - The top Democrat on the Senate...,politicsNews,"December 18, 2017"
427,Factbox: Republicans to watch in U.S. Senate t...,WASHINGTON - Some key U.S. senators still had ...,politicsNews,"November 30, 2017"
1141,GAO opens door for Congress to review leverage...,NEW YORK (IFR) - The investigative arm of Cong...,politicsNews,"October 19, 2017"
3488,White House unveils list of ex-lobbyists grant...,The White House on Wednesday disclosed a group...,politicsNews,"June 1, 2017"
4358,Factbox: Trump Supreme Court appointee to affe...,"Neil Gorsuch, President Donald Trump’s appoint...",politicsNews,"April 7, 2017"
5363,Trump's defense chief visits UAE in first Midd...,ABU DHABI - U.S. President Donald Trump’s defe...,politicsNews,"February 18, 2017"
5784,Trump Supreme Court nominee Gorsuch seen in th...,"Federal appeals court judge Neil Gorsuch, the ...",politicsNews,"February 1, 2017"
5821,Kushner divests equity in major NYC property,NEW YORK (IFR) - Jared Kushner has divested hi...,politicsNews,"January 31, 2017"
6823,Commentary: Trump can't fight Islamic State wi...,Over the course of the U.S. presidential campa...,politicsNews,"December 7, 2016"
7365,Tough reality check for Trump's pledge of bett...,"CHARLOTTE, North Carolina - Donald Trump’s pro...",politicsNews,"November 10, 2016"


### Very briefly, take a look at an article or two from each group (misinfo / true) 

In [350]:
# fake
fake_dataset.iat[0,1]

'Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year,  President Angry Pants tweeted.  2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America!  Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t ev

In [351]:
fake_dataset.iat[1,1]

'House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia investigation so he s been lashing out at the Department of Justice and the FBI in order to protect Trump. As it happens, the dossier is not what started the investigation, according to documents obtained by the New York Times.Former Trump campaign adviser George Papadopoulos was drunk in a wine bar when he revealed knowledge of Russian opposition research on Hillary Clinton.On top of that, Papadopoulos wasn t just a covfefe boy for Trump, as his administration has alleged. He had a much larger role, but none so damning as being a drunken fool in a wine bar. Coffee boys  don t help to arrange a New York meeting between Trump and President Abdel Fattah el-Sisi of Egypt two months before the election. It was known before that the former aide set up meetings with world leaders for Trump, but team Tr

In [352]:
# true
true_dataset.iat[0,1]

'WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support educat

In [353]:
true_dataset.iat[1,1]

'WASHINGTON (Reuters) - Transgender people will be allowed for the first time to enlist in the U.S. military starting on Monday as ordered by federal courts, the Pentagon said on Friday, after President Donald Trump’s administration decided not to appeal rulings that blocked his transgender ban. Two federal appeals courts, one in Washington and one in Virginia, last week rejected the administration’s request to put on hold orders by lower court judges requiring the military to begin accepting transgender recruits on Jan. 1. A Justice Department official said the administration will not challenge those rulings. “The Department of Defense has announced that it will be releasing an independent study of these issues in the coming weeks. So rather than litigate this interim appeal before that occurs, the administration has decided to wait for DOD’s study and will continue to defend the president’s lawful authority in District Court in the meantime,” the official said, speaking on condition 

### Preparing the Data

In [427]:
# modify the dataset 
l = [] 
l2 = []
for row in fake_dataset["title"]:
    if "Trump" in row: # ensure that Donald Trump is discussed in the article
        l.append(row.split(" "))

for row in l: 
    for word in row:
        l2.append(word)
        
topten = pd.Series(l2).value_counts()[:2].append(pd.Series(l2).value_counts()[3:10])
topten

Trump      5797
           5074
For        1541
The        1539
Trump’s    1321
(VIDEO)    1210
Of         1140
In         1124
A          1083
dtype: int64

An entire visualization showing how commoon the words "For", "The", "Of"... etc are is not very interesting. We can use the NLTK (Natural Language ToolKit) to remove these "boring" words so we can see what else there is. 

In [452]:
# get NLTK items
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

NLTK includes a powerful tool called "stopwords". Stopwords are the "boring" words mentioned earlier. We can remove them from our dataset with the help of the "stopwords" set. 

In [453]:
# Set our stopwords
stopwords=set(stopwords.words('english'))
stopwords

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [525]:
# modify the dataset again
l = [] 
l2 = []
for row in fake_dataset["title"]:
    if "Trump" in row:
        l.append(row.split(" "))

for row in l: 
    for word in row: 
        if word not in stopwords:
            l2.append(word)
        
ftopten = pd.Series(l2).value_counts()[:2].append(pd.Series(l2).value_counts()[3:10])
ftopten


Trump      5797
           5074
For        1541
The        1539
Trump’s    1321
(VIDEO)    1210
Of         1140
In         1124
A          1083
dtype: int64

In [526]:
# remove more boring words manually
stopwords2 = set.union(stopwords, {'', "The", "would","says","Says", "said", "like", "I", "one", "It", "also", "He", 
                                  "even", "We", "get", "This", "could", "In", "going", "via", "To",
                                 "Of", "A", "On", "[Video]", "(VIDEO)", "[VIDEO]", "And", "For", "With",
                                 "About", "After", "Just", "Is", "At", "By", "From", "His", "Who", "Over",
                                 "Her", "That", "Him", "As", "Are", "Be", "Will", "THE", "TO", "OF", "Out",
                                 "Was", "Has", "Trump,", "Trump’s", topten.index[7]})
# Note: I removed some words like "Trump's" becaues it is essentially a duplicate of "Trump" and doesn't really present more data. 
l = [] 
l2 = []
for row in fake_dataset["title"]:
    if "Trump" in row:
        l.append(row.split(" "))

for row in l: 
    for word in row: 
        if word not in stopwords2:
            l2.append(word)
        
ftopten = pd.Series(l2).value_counts()[:2].append(pd.Series(l2).value_counts()[2:10])
ftopten

Trump        5797
Donald        714
WATCH:        649
President     525
Obama         319
(TWEETS)      307
Hillary       275
Gets          266
GOP           251
New           230
dtype: int64

In [527]:
# true dataset
l = [] 
l2 = []
for row in true_dataset["title"]:
    if "Trump" in row:
        l.append(row.split(" "))

for row in l: 
    for word in row: 
        if word not in stopwords2 and word != "Trump's":
            l2.append(word)
        
ttopten = pd.Series(l2).value_counts()[:2].append(pd.Series(l2).value_counts()[2:10])
ttopten

Trump         4399
U.S.           844
House          382
White          298
North          211
Factbox:       207
Russia         204
Republican     203
Clinton        185
tax            173
dtype: int64

In [528]:
# clean the false data a little more
fcounts = pd.Series(ftopten.values)
fwords = pd.Series(ftopten.index)
place = pd.Series([1,2,3,4,5,6,7,8,9,10])
ftt = fwords.apply(lambda x: "Word: " + x) + fcounts.apply(lambda x: " | Count: " + str(x)) #"tt" means "tooltip"
fnewtopten = pd.concat([fcounts, fwords, place, ftt], axis=1)

tcounts = pd.Series(ttopten.values)
twords = pd.Series(ttopten.index)
ttt = twords.apply(lambda x: "Word: " + x) + tcounts.apply(lambda x: " | Count: " + str(x))
tnewtopten = pd.concat([tcounts, twords, place, ttt], axis=1)


In [529]:
# concatenate into one dataset
identity = pd.Series(np.linspace(0,1,20)).apply(round).apply(lambda x: "Untruthful" if x == 1 else "Truthful") # misinformative or true
tnewtopten = pd.concat([tnewtopten, identity[:10]], axis=1)

tnewtopten.index = np.linspace(0,10,10)
tnewtopten
fnewtopten = pd.concat([fnewtopten, pd.Series(identity[10:].values)], axis=1)

final_df = pd.concat([tnewtopten, fnewtopten], axis=0)
final_df.columns = ["Count", "Word", "Place", "Tooltip", "Article Identity"]


In [530]:
final_df

Unnamed: 0,Count,Word,Place,Tooltip,Article Identity
0.0,4399,Trump,1,Word: Trump | Count: 4399,Truthful
1.111111,844,U.S.,2,Word: U.S. | Count: 844,Truthful
2.222222,382,House,3,Word: House | Count: 382,Truthful
3.333333,298,White,4,Word: White | Count: 298,Truthful
4.444444,211,North,5,Word: North | Count: 211,Truthful
5.555556,207,Factbox:,6,Word: Factbox: | Count: 207,Truthful
6.666667,204,Russia,7,Word: Russia | Count: 204,Truthful
7.777778,203,Republican,8,Word: Republican | Count: 203,Truthful
8.888889,185,Clinton,9,Word: Clinton | Count: 185,Truthful
10.0,173,tax,10,Word: tax | Count: 173,Truthful


In [559]:
# Time to plot! code source / inspiration: https://uwdata.github.io/visualization-curriculum/altair_marks_encoding.html
alt.Chart(
    final_df
).mark_point(
    filled=True,
    opacity = 0.7
).encode(
    alt.X('Place:Q', scale=alt.Scale(padding=10), axis=alt.Axis(values=[1,2,3,4,5,6,7,8,9,10])),
    alt.Size("Count:Q"),
    alt.Tooltip("Tooltip"),
    alt.Column("Article Identity:N"),
    alt.Color("Article Identity:N")
).properties(
    width = 400,
    height = 200,
    title="Num. Occurrences of Common Words in Articles Discussing Trump"
).configure(
    font = 'times new roman',
    autosize = "pad"
).configure_axis(
    grid=False,
    tickSize = 0
).configure_legend(
    strokeColor = "gray",
    padding=7.5,
    cornerRadius=10
).configure_title(
    fontSize=20
)




# Justification for Visualization

In my _authentic_ visualization, I use two separate graphs to display the top ten most common words in a dataset of both truthful and misinformative (untruthful) articles discussing Donald Trump, along with the number of times each word occurs. I believe that the visualization depicts the nuanced nature of misinformation, as there is nothing particular about any of the words included in the visualization that is "misinformative" or "truthful"; I could switch the labels on the dataset (label the misinformative words as truthful and vice-versa) and my visualization would seem equally as valid. 

__Color__
> In my visualization, the truthful words are represented in blue and the misinformative / untruthful ones are shown in orange. Blue and orange were the colors that Altair automatically used, and I left them unchanged because I felt that they perfectly illustrated the contrast between truth and misinformation. Blue is a color that represents safety (think: blue sky), while orange alludes to a medium level of danger (not quite as severe as red, but it conveys a sense of looming doom). 

__Tooltips__
> I think that the tooltips, courtesy of Altair, were my favorite detail of this project (hover your mouse over each of the circles in the visualization to see what word is represented and how many times it appears in a dataset). Because some of the words in my dataset appear thousands of times, solely using the radius of each circle in my visualization to represent the # of occurrences would not be very detailed (nobody would be able to tell just from radius exactly how many times a word appears). Further, the idea of putting each word on the x-axis so that each circle in the visualization would have an identity attached felt clunky to me. Using the tooltip was therefore a way for me to display my data conscicely and cleanly. 

__Place__ 
> Each circle in the visualization hovers above a number in the x-axis that indicates its rank in the list of most common words for either the truthful or misinformative / untruthful articles (ex. "Trump", with rank 1, is the most common word in both misinformative and truthful articles). 

__Size of Circles__
> Simply put, the size of each circle in the visualization indicates the number of times the word it represents occurrs in a dataset; it seemed logical to me that the more common words be represented by a larger mark. However, the tooltips, which give an actual count of each word's occurrences, are a much more conscice representation of word commonness (is that a word???) than circle size.

__Legend__
> The legend, located towards the right of the visualization, explains my usage of color and size (see above). Although some of the information in the legend can be inferred from the visualization, I thought it important to include the legend just to be thorough and eliminiate any possible viewer confusion.  

__Mistakes__
> Unfortunately, I could not figure out (even after hours of scouring the web) how to make the circles in my visualization larger. This is likely a result of the circle sizes being automatically set by Altair based on the number of word occurrences.

# Help / Sources Consulted

changing graph padding: https://www.markhneedham.com/blog/2021/04/02/altair-discrete-x-axis-margin-padding/

general graph configuration: https://altair-viz.github.io/user_guide/configuration.html

Altair documentation as a whole:  https://altair-viz.github.io/index.html