## Initializing Sources and "Interesting Words"

In [1]:
import mediacloud
import os
import json
import csv
import requests
import collections
from nltk.sentiment.vader import SentimentIntensityAnalyzer



In [2]:
MY_API_KEY = '0b048304d2f7398cb91248b7e07b3b153d32840c1a8c42ab4006f58aaa8a440a'

In [3]:
left_media = ["New York Times", "NPR", "Politico", "CNN", "American Civil Liberties Union"]
left_fake_media = ["If You Only News", "Occupy Democrats", "FWD Now", "Jezebel", "Blue Tribune"]
right_media = ["Fox News", "Daily Telegraph", "Chicago Tribune", "Forbes", "Washington Times"]
right_fake_media = ["Russia Today", "InfoWars", "Natural News", "DCLeaks", "Breitbart"]

## Mapping Media Names to Mediacloud ID's

Media sources in Mediacloud are referenced in API calls through their unique media_id. We get a mapping of media sources to their respective id's by writing them to a csv via the function below.

In [None]:

def write_csv_media(media_name, rows=100):
    """
    Args: media_names: media source name (can have multiple) in list format
          chunk_size: max # of media sources that can be written per iteration
    Write the media_id, url, and names of the given media_name(s) in a CSV file
    """
    media = []
    media_idx = 0
    last_media_id = 0

    fieldnames = [
        u'media_id',
        u'url',
        u'name'
    ]

    while True:
        params = { 'last_media_id': last_media_id, 'rows': rows, 'name': media_name[media_idx], 'key': MY_API_KEY }
        media_list_call = 'https://api.mediacloud.org/api/v2/media/list'
        r = requests.get( media_list_call, params = params, headers = { 'Accept': 'application/json'} )
        data = r.json()
        print ("start:{} num_media_sources:{}".format(last_media_id, len(data)))

        if not len(data):

            # If there are more media_names, write what we have to csv file and continue
            path_name = './csv_storage/media.csv'
            with open( path_name, 'a', newline="") as csvfile:
                print ("\nOpened file: Dumping media source content for {}\n".format(media_name[media_idx]))

                # Flush media buffer to csv file
                cwriter = csv.DictWriter( csvfile, fieldnames, extrasaction='ignore')

                if not os.path.getsize(path_name):
                    cwriter.writeheader()

                cwriter.writerows( media )

            # Continue to next user-inputted media_name
            media_idx += 1
            last_media_id = 0
            media = []
            if media_idx < len(media_name):
                print ("Grabbing sources of next media name:{}\n".format(media_name[media_idx]))
                continue

            # Done if no more media sources to get
            break


        #add to media buffer and search for more media sources similar to current media_name
        media.extend( data )
        last_media_id = media[-1]['media_id']
 

In [12]:
write_csv_media(left_media)
write_csv_media(left_fake_media)
write_csv_media(right_media)
write_csv_media(right_fake_media)

start:0 rows:100
start:651204 rows:100

Opened file: Dumping media source content for New York Times

Grabbing sources of next media name:NPR

start:0 rows:100
start:91179 rows:100
start:139914 rows:100
start:190302 rows:100
start:240503 rows:100
start:299186 rows:100
start:369062 rows:100
start:443490 rows:100
start:508448 rows:100
start:558433 rows:100
start:616090 rows:100
start:705557 rows:100
start:763846 rows:100
start:821041 rows:100
start:885800 rows:100
start:944458 rows:100
start:1011207 rows:100
start:1013589 rows:100

Opened file: Dumping media source content for NPR

Grabbing sources of next media name:Politico

start:0 rows:100
start:345823 rows:100
start:958555 rows:100
start:990826 rows:100

Opened file: Dumping media source content for Politico

Grabbing sources of next media name:CNN

start:0 rows:100
start:268057 rows:100
start:780894 rows:100
start:999468 rows:100

Opened file: Dumping media source content for CNN

Grabbing sources of next media name:American Civil 

In [4]:
#Media_ID's with respect to their names in the above lists. Fill in later.
left_media_dict = {'1': 'New York Times', '1096': 'NPR', '18268': 'Politico', '1095': 'CNN', '27427': 'American Civil Liberties Union'}
left_fake_media_dict = {'206424': 'ifyouonlynews.com', '6154': 'Jezebel'}
right_media_dict = {'1092': 'FOX News', '1750': 'Daily Telegraph', '9': 'Chicago Tribune', '1104': 'Forbes', '101': 'Washington Times'}
right_fake_media_dict = {'305385': 'Russia Today', '18515': 'InfoWars', '24030': 'Natural News', '302379': 'dcleaks.com', '19334': 'Breitbart'}

## Word and Sentence Collecting

Define functions here to conduct some data-processing of words and sentences.

First, we'll grab "word matrices" to get "interesting words" on the article level. Below, we define a function that will take in a media source, and output a dictionary. This dictionary will have specific stories as keys, and the values will be words followed by the number of times they are repeated. 

In [5]:
def get_story_word_matrix(media_id, rows=1000):
    """
    Args: media_id: ID of media source to grab word matrices for. Stories in the returned dicts
    will only be from this media_id.
    """
    params = {'rows': rows, 'q': 'media_id: {}'.format(media_id), 'key': MY_API_KEY}
    word_matrix_call = 'https://api.mediacloud.org/api/v2/stories_public/word_matrix'
    r = requests.get(word_matrix_call, params = params, headers = { 'Accept': 'application/json'} )
    data = r.json()
    return data['word_list'], data['word_matrix']

Next, we create a function that creates a list of these word matrices for each biased set of media sources we defined above.

Also, we will print the word list and matrix out for each media source we go through (**WARNING: Causes my computer to iPython to freeze**).

In [6]:
def get_story_word_matrices_for_all_media(media_dict):
    word_lists_and_matrices = []
    for media_id in media_dict:
        data = get_story_word_matrix(media_id)
#         print ('Word list for media source: {}\n\n'.format(media_dict[media_id]))
#         print (data[0])
#         print ('Word matrix of word frequency counts by stories_ids for {}\n\n'.format(media_dict[media_id]))
#         print (data[1])
        word_lists_and_matrices.append(data)
    return word_lists_and_matrices

In [7]:
left_media_data = get_story_word_matrices_for_all_media(left_media_dict)
left_fake_media_data = get_story_word_matrices_for_all_media(left_fake_media_dict)
right_media_data = get_story_word_matrices_for_all_media(right_media_dict)
right_fake_media_data = get_story_word_matrices_for_all_media(right_fake_media_dict)

In [21]:
def get_repeat_words_within_media(data, threshold=20):
    """
    Args: data: word_list_and_matrix type from word_matrix API call's return
          threshold: int to determine how many times a word must be repeated in an article for it to classify as 
          a "repeat word"
    
    Returns a dictionary with stories_id as a key mapped to a list of repeated words in that article as determined by
    the threshold argument variable.
    """
    repeat_words = {}
    stories_ids = data[1].keys()
    for stories_id in stories_ids:
        word_ids = data[1][stories_id].keys()
        for word_id in word_ids:
            if data[1][stories_id][word_id] >= threshold:
                interesting_word = data[0][int(word_id)]
                if stories_id in repeat_words:
                    repeat_words[stories_id].append(interesting_word)
                repeat_words[stories_id] = [interesting_word]
                
    return repeat_words
        

Getting the articles with repeat_words in ifyouonlynews.com, New York Times, Russia Today, and FOX News, respectively.

In [22]:
left_fake_stories_with_repeat_words = get_repeat_words_within_media(left_fake_media_data[0])
left_stories_with_repeat_words = get_repeat_words_within_media(left_media_data[0])
right_fake_stories_with_repeat_words = get_repeat_words_within_media(right_fake_media_data[0])
right_stories_with_repeat_words= get_repeat_words_within_media(right_media_data[0])

print ("Fake left news stories with repeat words\n\n{}\n\n".format(left_fake_stories_with_repeat_words))
print ("Left news stories with repeat words\n\n{}\n\n".format(left_stories_with_repeat_words))
print ("Fake right news stories with repeat words\n\n{}\n\n".format(right_fake_stories_with_repeat_words))
print ("Right news stories with repeat words\n\n{}".format(right_stories_with_repeat_words))


Fake left news stories with repeat words

{'913548577': [['nixon', 'nixon']], '763478723': [['weinstein', 'weinstein']], '445589337': [['connor', 'connor']], '222771958': [['educ', 'education']], '27240834': [['balenciaga', 'balenciaga']], '1071228560': [['peni', 'penis']], '950849964': [['stassa', 'stassa']], '158300060': [['joe', 'joe']], '998761531': [['moe', 'moe']], '618597446': [['terranc', 'terrance']], '123219170': [['comic', 'comics']], '522180894': [['diaper', 'diapers']], '581587251': [['russian', 'russian']], '426458796': [['gawker', 'gawker']], '349827897': [['solo', 'solo']], '136948597': [['masterson', 'masterson']], '29013090': [['christma', 'christmas']], '297234677': [['id', 'id']], '480125819': [['herb', 'herb']], '825006726': [['twitter', 'twitter']], '210582700': [['rust', 'rust']], '435982987': [['kasich', 'kasich']], '156275859': [['io', 'ios']], '1069609497': [['commiss', 'commission']], '334372012': [['jonah', 'jonah']], '21353243': [['ramona', 'ramona']], '604

Repeat words for individual articles can help us find interesting content (and connections) to other articles for identifying smaller cliques and trends. But what if we want to find the most popular words per media source? We define a function to do this below.

In [23]:
def get_popular_words(media_id, sample_size, word_num, num_times): 
    """
    Args: media_id: media_id of a media_source
          sample_size: number of sentences to sample
          num_times: number of times to run the function and return average counts for the top 'word_num' words
          word_num: number of top words to retain for averaging in each run
          
    Returns: Most popular words used by the entire media source on average
    """
    popular_words = collections.Counter()
    word_count_call = "https://api.mediacloud.org/api/v2/wc/list"
    params = {'q': 'media_id: 1', 'sample_size': 3500, 'key': MY_API_KEY}
    
    for _ in range(num_times):
        r = requests.get(word_count_call, params = params, headers = { 'Accept': 'application/json'})
        data = r.json()
    
        for i in range(word_num): 
            popular_words[data[i]['stem']] += data[i]['count']
    
    for word in popular_words:
        popular_words[word] /= num_times
        
    
    return popular_words 


Let's take a look at some of our most popular words for each media source!

In [24]:
for media_id in left_fake_media_dict:
    popular_words = get_popular_words(media_id, 3500, 50, 3)
    print ("Popular words for {}\n\n{}\n\n".format(left_fake_media_dict[media_id], popular_words))

for media_id in left_media_dict:
    popular_words = get_popular_words(media_id, 3500, 50, 3)
    print ("Popular words for {}\n\n{}\n\n".format(left_media_dict[media_id], popular_words))
    
for media_id in right_fake_media_dict:
    popular_words = get_popular_words(media_id, 3500, 50, 3)
    print ("Popular words for {}\n\n{}\n\n".format(right_fake_media_dict[media_id], popular_words))
    
for media_id in right_media_dict:
    popular_words = get_popular_words(media_id, 3500, 50, 3)
    print ("Popular words for {}\n\n{}\n\n".format(right_media_dict[media_id], popular_words))

Popular words for Jezebel

Counter({'american': 74.66666666666667, 'unit': 57.666666666666664, 'republican': 41.666666666666664, 'obama': 40.0, 'democrat': 35.666666666666664, 'campaign': 32.666666666666664, 'trump': 32.333333333333336, 'washington': 29.666666666666668, 'univers': 29.0, 'photo': 28.0, 'senat': 27.666666666666668, 'clinton': 25.666666666666668, 'children': 25.333333333333332, 'china': 24.666666666666668, 'leagu': 24.0, 'econom': 23.666666666666668, 'score': 23.666666666666668, 'john': 20.333333333333332, 'sport': 20.0, 'america': 19.666666666666668, 'educ': 19.0, 'victori': 19.0, 'tax': 18.666666666666668, 'colleg': 18.333333333333332, 'william': 17.333333333333332, 'congress': 17.0, 'inning': 15.666666666666666, 'leader': 15.333333333333334, 'global': 15.0, 'professor': 15.0, 'media': 13.666666666666666, 'nuclear': 13.0, 'abort': 12.333333333333334, 'nation': 12.0, 'michael': 12.0, 'texa': 12.0, 'elect': 11.666666666666666, 'student': 11.666666666666666, 'san': 11.6666

Seems like a lot of the most popular words in most media sources are pretty similar. 

The interesting thing is that the order is switched with somewhat of an identifiable trend. Major(large) media outlets that are moderately left or right reference the opposing political party's last presidential candidate more than they do their own. FOX News, Forbes, and Daily Telegraph reference 'Obama' more than 'Trump' while the opposite holds true for NY Times, Politico, NPR, ACLU, and CNN. 

Extreme media sources tend to report on their own candidate with higher frequency. For example, Jezebel and ifyouonlynews.com still reference Obama more than Trump while the right-wing fake news sources do the opposite. What needs to be taken in to account here is the context of these popular and repeated words. High frequency of usage does not mean the same thing in every media source.

Another factor that has to be taken in to account is that data has been scraped longer to include Obama since he has already served 2 full terms as a President and may be more frequent considering he has been relevant in the public eye for longer. However, some might argue that Trump has been far more newsworthy in his relatively brief amount of time.

To have a better look at context for each media source, let's take some sentences these popular words are a part of and examine their surrounding words. For brevity, we'll just use 'Trump' and 'Obama' for now.

In [35]:
def get_word_count_within_media(media_id, word, num_words=50, sample_size=1000): 
    """
    Args: media_id: ID of media_source to grab word count of interesting words for. 
    """
    params = {'num_words': num_words, 'sample_size': sample_size, 'include_stats': 1, 
              'q': 'media_id: {} AND {}'.format(media_id, word), 'key': MY_API_KEY}
    word_count_call = 'https://api.mediacloud.org/api/v2/wc/list'
    r = requests.get(word_count_call, params = params, headers = { 'Accept': 'application/json'} )
    data = r.json()
    return data

In [36]:
random_interesting_words = ['Trump', 'Obama', 'fake news', 'terrorist', 'Kavanaugh', 'midterm', 'Republican', 'Democrat', 'election',
                     'Russia', 'Jeff Sessions', 'Attorney General', 'tolerance', 'racism', 'sexism', 'gender', 'snowflake',
                     'shooting', 'massacre', 'guns', 'abortion', 'radical', 'leftwing', 'rightwing', 'queer', 'gay', 'religion',
                     'healthcare', 'universal', 'immigrant', 'refugee', 'Syria', 'education', 'Beto Rourke']

In [39]:
for media_id in left_fake_media_dict:
    for word in random_interesting_words[:2]:
        print ("\n\nWords in sentences from {} that contain the popular word:{}\n\n".format(left_fake_media_dict[media_id], word))
        data = get_word_count_within_media(media_id, word)
        print (data)

for media_id in left_media_dict:
    for word in random_interesting_words[:2]:
        print ("\n\nWords in sentences from {} that contain the popular word:{}\n\n".format(left_media_dict[media_id], word))
        data = get_word_count_within_media(media_id, word)
        print (data)  
        
for media_id in right_fake_media_dict:
    for word in random_interesting_words[:2]:
        print ("\n\nWords in sentences from {} that contain the popular word:{}\n\n".format(right_fake_media_dict[media_id], word))
        data = get_word_count_within_media(media_id, word)
        print (data)
        
for media_id in right_media_dict:
    for word in random_interesting_words[:2]:
        print ("\n\nWords in sentences from {} that contain the popular word:{}\n\n".format(right_media_dict[media_id], word))
        data = get_word_count_within_media(media_id, word)
        print (data)



Words in sentences from Jezebel that contain the popular word:Trump


{'words': [{'stem': 'trump', 'count': 1116, 'term': 'trump'}, {'stem': 'donald', 'count': 304, 'term': 'donald'}, {'stem': 'campaign', 'count': 67, 'term': 'campaign'}, {'stem': 'republican', 'count': 43, 'term': 'republican'}, {'stem': 'elect', 'count': 41, 'term': 'election'}, {'stem': 'clinton', 'count': 41, 'term': 'clinton'}, {'stem': 'ivanka', 'count': 37, 'term': 'ivanka'}, {'stem': 'alleg', 'count': 32, 'term': 'allegations'}, {'stem': 'america', 'count': 28, 'term': 'america'}, {'stem': 'melania', 'count': 27, 'term': 'melania'}, {'stem': 'hillari', 'count': 25, 'term': 'hillary'}, {'stem': 'report', 'count': 23, 'term': 'reportedly'}, {'stem': 'american', 'count': 23, 'term': 'american'}, {'stem': 'unit', 'count': 22, 'term': 'united'}, {'stem': 'tweet', 'count': 22, 'term': 'tweet'}, {'stem': 'presid', 'count': 22, 'term': 'presidency'}, {'stem': 'washington', 'count': 21, 'term': 'washington'}, {'stem':

Notes from a quick glance: 
- The word 'media' is mentioned a lot more alongside 'Trump' in right wing and extreme right wing, fake news sources.
- Will find more later - didn't have time to look with more detail

**Important Note**

Before we move on to sentiment analysis and try finding interesting links for our words, we should play around more with the functions above to give us a sense of direction. The setup so far just has a very limited set of media sources and randomly defined interesting words. Even with the provided random words, I haven't run these functions on them(but plan to soon). 

Furthermore, the functions need not run fully sequentially. By that, I mean that certain information can be re-fed in to other functions. I plan on gathering more of the popular words per media source and feeding them in to the word matrix function. I'd be able to the dictionary of article mappings to repeat words that are also popular among the media sources they're a part of. This would allow me - in addition to retrieving the surrounding words in sentences above and conducting sentiment analysis below - have a better context for understanding how media sources with different biases use different words!

## Sentiment Analysis and Interesting Links (Coming by 11/23 - along with nicer visualizations)

In [None]:
def analyze_sentiment():
#     Analyze sentiment of word counts for the "interesting sentences" above
    sid = SentimentIntensityAnalyzer()
    for 
