# Mini Project 1

## Exploring the evolution of extremist conspiracy theories through a case study of www.infowars.com

In this mini project, I explore the evolution of extremist conspiracy theories by scraping the contents of the InfoWars website. Extremist conspiracy theories are defined as theories that promote both conspiratorial beliefs and extremist political ideologies. I chose to investigate InfoWars due to its status as one of the most popular and influential far-right conspiracy websites in the United States. 

My webscraper consists of three functions. 
1. wayback_scraper() Scrapes the Wayback Machine's Internet Archive to retreieve daily snapshots of InfoWars' main content page
2. infowars_scraper() Works with pages scraped by the function in (1), further scraping article headlines, links and their associated contents. For hyperlinks provided between 2016 and 2021, the content also includes the authors of the website.
3. format_infowars_archive(). For pre-2016 articles, this function takes the output of (2) and further cleans it, extracting the names of authors from the article contents and appending them to a new column
4. read_json(). Reads any resultant jason files as a pandas dataframe for use in analysis

## Full description of my archive and a critical evaluation is in the attached word document provided in the WinRAR archive

### The archive includes:
1. The code
2. A nicely cleaned weeklong archive of a week's worth of articles between 2nd and 7th November 2008
3. A word document with the full description of my project

## Scraping all the Websites you Need

Here I scrape the infowars pages I want with the wayback_scraper function and save the output in a json file

In [1]:
#Packages Needed 
import requests
import time
import random
import json
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
#web scraping function

def wayback_scraper(url, start_date, end_date, save_to_json=None):
    '''
    Returns all Wayback Machine snapshots taken of a particular
    URL over a stretch of time
    
    Input: url (str), start_date (str), end_date (str)
        (Dates are of form: 'YYYYMMDD')
    Returns: dictionary of scraped web pages
        (key: date (str), value: scraped HTML (str))
    '''
    # Get list of snapshots for specified URL between start and end date
    r = requests.get('https://web.archive.org/cdx/search/cdx',
                     params={'url': url,
                             'from': start_date,
                             'to': end_date,
                             'collapse': 'digest',
                             'output': 'json'})
    data = r.json()

    # scrape each snapshot in list of snapshots
    wayback_base_url = 'http://web.archive.org/web/'
    scraped_pages = {}
    for entry in data[1:]:
        full_url = wayback_base_url + entry[1] + '/' + url
        try:
            r = requests.get(full_url)
            scraped_pages[entry[1]] = r.text
            #time.sleep(random.uniform(1, 5))
        except:
            print("Failed to append", full_url)
    
    if save_to_json:
        with open(save_to_json, 'w') as file:
            json.dump(scraped_pages, file)
    
    return scraped_pages

# use scraper like so (can also save in its own module if have time):
# wayback_scraper('uchicago.edu', '19970101', '19981231', 'my_scraped_pages.json')

# Scrape three sets of InfoWars Webpages

1. InfoWars pre-2008
2. InforWars between 2008 - 2015
3. InfoWars between 2016-2021

The InfoWars site has undergone numerous changes. The three broad changes occur during the time periods mentioned above, and hence the scraper has to use different techniques across each period to retreieve the relevant content

I take the saved json file earlier and run it through the second function below to retrieve the links, content, article titles and article content. I save the output in a JSON file. For post-2016 articles this includes the authors as well. For pre-2016 articles a third function is needed, written below

## Going into those scraped pages to find the data I want

In [3]:
#Scrape links, article titles and article text from InfoWars.com

def infowars_scraper(data, save_to_json=None):
    '''
    Takes a loaded json file for InfoWars article and returns a dataframe consisting of article titles, 
    article hyperlinks and article content created by InfoWars. For post-2016 articles, because authors are indexed in a
    separate tag, the author names are appended as well
    Output is saved to a json file.
    
    Input:
        data (a json file): json file with scraped web pages
        save_to_json (string): name of output json file
    
    Output:
        (dataframe): a Pandas Dataframe of the processed InfoWars data
        (JSON): a json file with the pandas data frame
    '''
    with open(data, 'r') as file:
        infowars_data = json.load(file)
    
    bookmark = ["bookmark"]

    article_hyperlinks = {} #empty dictionary for hyperlinks
    article_titles = {} #empty dictionary for article titles
    article_contents = {} #empty dictionary for article contents
    article_authors = {}
    
    wayback_url = 'https://web.archive.org/web/'
    infowars_url = '/https://www.infowars.com/'
    infowars_url_alt = '/https://www.infowars.com'
    wayback_url_alt = 'https://web.archive.org'

    container = {} #temporay container dictionary for intermediate operations

    for date, page in infowars_data.items():
        
        soup = BeautifulSoup(page, 'html.parser')
        a_tags = soup.find_all('a') #find all a tags
        td_tags = soup.find_all('td', class_ = "style1") #find all td tags of class "style1"
        
        if int(date) < 20071231999999: #for links before 2008

            hyperlinks = []

            #return hyperlinks if they are infowars hyperlinks linking to other articles

            for td_tag in td_tags: #td_tags contain the articles featured on InfoWars
                links = td_tag.find_all('a')

                for link in links:
                    if link['href'].startswith('https://web.archive.org') \
                     and 'infowars.com' in link['href']:
                        hyperlinks.append(link['href'])
                    else:
                        if len(hyperlinks) < 5: #some pre-08 websites are weird 
                            hyperlinks.append(wayback_url + date + infowars_url + link['href'])
 

            #only return unique hyperlinks
            
            hyperlinks = list(set(hyperlinks))
            article_hyperlinks[date] = hyperlinks

            titles = []

            #store all article titles in titles[]
            for td_tag in td_tags:
                links = td_tag.find_all('a')
                for link in links:    
                    if link['href'].startswith('http://web.archive.org') \
                     and 'infowars.com' in link['href']:
                        titles.append(link.text)
                    else:
                        titles.append(link.text)
                        print("This may not be a title")
            
            #only return unique titles
            
            titles = list(set(titles))
            article_titles[date] = titles 
        
        elif 20071231999999 < int(date) < 20151231999999: #for links between 2008 and 2015
            
            for date, page in infowars_data.items():
                
                soup = BeautifulSoup(page, 'html.parser')
                a_tags = soup.find_all('a') #find all a tags
                hyperlinks = []
                titles = []

                #find all article links and titles, storing them in hyperlinks[] and titles[] respectively
                for a_tag in a_tags:
                    try:
                        if a_tag['rel'] == bookmark:
                             hyperlinks.append(a_tag['href'])
                    except:
                        print("This tag does not contain a hyperlink")

                    try:
                        if a_tag['rel'] == bookmark \
                        and a_tag['title'].startswith('Permanent Link to'):
                             titles.append(a_tag['title'])
                    except:
                        print("This is not an article title")



                hyperlinks = list(set(hyperlinks))
                titles = list(set(titles))

                article_hyperlinks[date] = hyperlinks
                container[date] = titles
                
                for date, titles in container.items():
    
                    #clean all titles, removing unnecessary text
                    titles_cleaned = []

                    for headline in titles:
                        headline = headline.replace('Permanent Link to ', '')
                        titles_cleaned.append(headline)

                    article_titles[date] = titles_cleaned
                    
        elif int(date) > 20151231999999:
            
            for date, page in infowars_data.items():
                
                soup = BeautifulSoup(page, 'html.parser')
                a_tags = soup.find_all('a')
                div_tags = soup.find_all('div', class_ = "article-content")

                hyperlinks = []

                for div_tag in div_tags:
                    links = div_tag.find_all('a')
                    for link in links:
                        try:
                            hyperlinks.append(link['href'])
                        except:
                            print("No hyperlink found in", link)

                hyperlinks = list(set(hyperlinks))

                article_hyperlinks[date] = hyperlinks
                
            for date, page in infowars_data.items():
                
                soup = BeautifulSoup(page, 'html.parser')
                div_tags = soup.find_all('div', class_ = "article-content")
                h1_tags = soup.find_all('h1', class_ = "css-f3lqni")
                titles = []

                for div_tag in div_tags:
                    links = div_tag.find_all('a')
                    for link in links:
                        titles.append(link.text)

                titles = list(set(titles))

                article_titles[date] = titles

    
    #Once the hyperlinks have been stored in article_hyperlinks{}, iterate through them and extract their contents 
    #store these contents in article_contents{}
    
    for date, hyperlink_list in article_hyperlinks.items():
        
        if int(date) < 20071231999999: #pre-2008 articles

            content = []

            for link in hyperlink_list:   

                r = requests.get(link)
                r_text = r.text
                soup = BeautifulSoup(r_text, 'html.parser')
                articles = soup.find_all('td')

                for article in articles:
                    try:
                        texts = article.find('p', class_ = "subheadline_body").text
                        content.append(texts)
                    except:
                        print("no content found")

            content = list(set(content))

            article_contents[date] = content
            
        elif 20071231999999 < int(date) < 20151231999999: #articles between 2008 and 2016
            
            content = []
    
            for link in hyperlink_list:   

                r = requests.get(link)
                r_text = r.text
                soup = BeautifulSoup(r_text, 'html.parser')
                articles = soup.find_all('div', class_ = 'subarticle')

                for article in articles:
                    texts = article.find('p').text
                    texts = texts.split('Random iframe content', 1)[0]#remove comments
                    content.append(texts)

            content = list(set(content))

            article_contents[date] = content
            
        elif int(date) > 20151231999999: #articles after 2016
    
            content = []
            authors = []

            for link in hyperlink_list:   

                r = requests.get(link)
                r_text = r.text
                soup = BeautifulSoup(r_text, 'html.parser')
                articles = soup.find_all('div', class_ = 'text')
                span_tags = soup.find_all('span', class_ = 'author')


                for article in articles:
                    texts = article.find('p').text
                    content.append(texts)
                
                for span_tag in span_tags:
                    link = span_tag.find('a')
                    try:
                        authors.append(link['title'])
                        authors.append(link.text)
                        authors.append(span_tag.text)
                    except:
                        print("No article author found")
                    

            content = list(set(content))
            authors = list(set(authors))

            article_contents[date] = content
            article_authors[date] = authors

    
    #combine all three dictionaries into a dataframe and store it in a csv
    
    infowars_df = pd.DataFrame({'contents':pd.Series(article_contents),'titles':pd.Series(article_titles), \
                                'links': pd.Series(article_hyperlinks), 'authors':pd.Series(article_authors)})

    infowars_json = infowars_df.to_json(orient = 'index')
    
    if save_to_json:
        with open(save_to_json, 'w') as file:
            json.dump(infowars_json, file)
    #json
    
    return infowars_df



## Cleaning up the data I have collected

### Currently only required for pre-2016 articles!

I clean up the data to have a nice pandas dataframe and save that to a final json file. This sample file is included in the archive

In [4]:
def format_infowars_archive(file, save_to_json=None):
    '''
    Takes a loaded json file of scraped and organized content from Infowars and cleans it. 
    For pre-2016 articles, the authors are retrievable by processing text from the content. This function thus
    cleans the JSON from the second function, and returns a dataframe consisting of article titles, 
    article hyperlinks, article content and authors created by InfoWars organised by date. Output is saved to a json
    file as well.
    
    Input:
        data (a json file): json file with processed InfoWars data
        save_to_json (string): name of output json file
    
    Output:
        (dataframe): a Pandas Dataframe of the cleaned InfoWars data
        (JSON): a json file with the pandas data frame
    '''
    with open(file, 'r') as file:
        data = json.load(file)
    
    df = pd.read_json(data, convert_axes=False)
    df = df.transpose()
    df.index.name = "Date-Time"
    
    new_column = []
    Date = []
    contents = df['contents']

    for articles in contents:
        authors = []
        for article in articles:
            line = article.splitlines()
            if len(line) > 1:
                if len(line[1]) < 100:
                    authors.append(line[1])
            else:
                if len(line[0]) < 100:
                    authors.append(line[0])
        new_column.append(authors)


    df["Authors"] = new_column
    
    for row in df.index:
        date = row[0:8]
        Date.append(date)
    
    df["Date"] = Date
    df["Date"] = df["Date"].astype(int)
    df['Date']= pd.to_datetime(df['Date'], format = "%Y%m%d")
    
    infowars_json = df.to_json(orient = 'index')
    
    if save_to_json:
        with open(save_to_json, 'w') as file:
            json.dump(infowars_json, file)

    return df

## Reading the file

In [5]:
def read_json(data):
    '''
    Reads a json file and carefully indexes it
    
    Input:
        data (json): A json file
    Output:
        df (dataframe): A pandas data frame    
    '''
    with open(data, 'r') as file:
        infowars_data = json.load(file)


    df = pd.read_json(infowars_data, convert_axes = False)
    df = df.transpose()
    df.index.name = "Date-Time"
    Date = []
    
    for row in df.index:
        date = row[0:8]
        Date.append(date)

    df["Date"] = Date
    df["Date"] = df["Date"].astype(int)
    df['Date']= pd.to_datetime(df['Date'], format = "%Y%m%d")
    
    return df

## Let's try it out!

In [9]:

wayback_scraper('infowars.com', '20121102', '20121203', 'obama_scraped_pages_nov12.json')
df = infowars_scraper('obama_scraped_pages_nov12.json', 'obama_scraped_pages_nov12_cleaned.json')
df = format_infowars_archive('obama_scraped_pages_nov12_cleaned.json', 'obama_scraped_pages_nov12_formatted.json')

This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article t

This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article title
This tag does not contain a hyperlink
This is not an article t



IndexError: list index out of range

In [14]:
df = read_json("obama_scraped_pages_nov09_cleaned.json")
df

Unnamed: 0_level_0,contents,titles,links,authors,Date
Date-Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20091124014606,"[\nJohn Byrne\nRaw Story\nNovember 24, 2009\nT...",[Hot ‘Climategate’ debate: Scientists clash LI...,[http://web.archive.org/web/20091125213718/htt...,,2009-11-24
20091125213718,"[\nJohn Byrne\nRaw Story\nNovember 24, 2009\nT...",[Hot ‘Climategate’ debate: Scientists clash LI...,[http://web.archive.org/web/20091125213718/htt...,,2009-11-25
20091126054429,"[\nJohn Byrne\nRaw Story\nNovember 24, 2009\nT...",[Hot ‘Climategate’ debate: Scientists clash LI...,[http://web.archive.org/web/20091126054429/htt...,,2009-11-26
20091129195231,"[\nRussia Today\nNovember 25, 2009\n\nA respec...",[Hot ‘Climategate’ debate: Scientists clash LI...,[http://web.archive.org/web/20091129195231/htt...,,2009-11-29
20091202003429,"[\nTony Romm\nThe Hill\nDecember 1, 2009\nA d ...",[Australian government wants right to detain s...,[http://web.archive.org/web/20091202003429/htt...,,2009-12-02


In [8]:
df = read_json('obama_scraped_pages_nov1_cleaned.json')
df

Unnamed: 0_level_0,contents,titles,links,authors,Date
Date-Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20081102183949,"[\nRIA Novosti\nNovember 2, 2008\nRussia’s For...",[US defense secretary expands pre-emptive war ...,[http://web.archive.org/web/20081102183949/htt...,,2008-11-02
20081104045042,"[\nCanadian Press\nNovember 3, 2008\nCalgary p...",[US defense secretary expands pre-emptive war ...,[http://web.archive.org/web/20081104045042/htt...,,2008-11-04
20081105095703,"[\nInfowars\nNovember 4, 2008\n\nFrom BoysStuf...",[US defense secretary expands pre-emptive war ...,[http://web.archive.org/web/20081105095703/htt...,,2008-11-05
20081105164643,"[\nInfowars\nNovember 4, 2008\n\nFrom BoysStuf...",[US defense secretary expands pre-emptive war ...,[http://web.archive.org/web/20081105164643/htt...,,2008-11-05
20081106054814,"[\nYouTube\nNovember 4, 2008\n\n\nA d v e r t ...",[IAEA chief: Iran not close to developing nucl...,[http://web.archive.org/web/20081106054814/htt...,,2008-11-06
20081108021127,"[\nSometimes, you wake up following a drunken ...","[Undercover cops were among the unruly at DNC,...",[http://web.archive.org/web/20081108021127/htt...,,2008-11-08
20081109030644,"[\nJoseph Cannon\nCannonfire\nNovember 8, 2008...","[Undercover cops were among the unruly at DNC,...",[http://web.archive.org/web/20081109030644/htt...,,2008-11-09
20081109030651,"[\nJoseph Cannon\nCannonfire\nNovember 8, 2008...","[Undercover cops were among the unruly at DNC,...",[http://web.archive.org/web/20081109030651/htt...,,2008-11-09
20081110174800,"[\nJoseph Cannon\nCannonfire\nNovember 8, 2008...",[Progressives Need to Dump Obama and the Corpo...,[http://web.archive.org/web/20081110174800/htt...,,2008-11-10
20081112071029,"[Australian IT\nNovember 11, 2008\nThe federal...",[Obama’s ‘Change’ Likely to Include Funding Ab...,[http://web.archive.org/web/20081112071029/htt...,,2008-11-12


## Let's read the sample file provided!

In [3]:
read_json('infowars_pre2016_cleaned.json')

Unnamed: 0_level_0,contents,titles,links,authors,Authors,Date
Date-Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20081102183949,"[\nThe Betrayal\nNovember 1, 2008\n\n\nA d v e...",[Obama: Government Should “Change Behavior” by...,[http://web.archive.org/web/20081102183949/htt...,,"[The Betrayal, Alex Lantier, Press TV, AFP, G...",2008-11-02
20081104045042,"[\nAlex Lantier\nWSWS\nOctober 31, 2008\nIn a ...","[D.C. Metro to Randomly Search Riders’ Bags, D...",[http://web.archive.org/web/20081104045042/htt...,,"[Alex Lantier, Canadian Press, Press TV, AFP,...",2008-11-04
20081105095703,"[\nAlex Lantier\nWSWS\nOctober 31, 2008\nIn a ...","[D.C. Metro to Randomly Search Riders’ Bags, M...",[http://web.archive.org/web/20081105095703/htt...,,"[Alex Lantier, Canadian Press, YouTube, Press...",2008-11-05
20081105164643,"[\nAlex Lantier\nWSWS\nOctober 31, 2008\nIn a ...","[D.C. Metro to Randomly Search Riders’ Bags, M...",[http://web.archive.org/web/20081105164643/htt...,,"[Alex Lantier, Canadian Press, YouTube, Press...",2008-11-05
20081106054814,"[\nCanadian Press\nNovember 3, 2008\nCalgary p...",[National road toll devices to be tested by dr...,[http://web.archive.org/web/20081106054814/htt...,,"[Canadian Press, 9/11 Blogger , YouTube, Pres...",2008-11-06
20081108021127,"[\n9/11 Blogger \nWednesday, Nov 5, 2008\n\n\n...",[National road toll devices to be tested by dr...,[http://web.archive.org/web/20081108021127/htt...,,"[9/11 Blogger , Press TV, AFP, Dana Milbank, ...",2008-11-08
20081109030644,"[\nInfowars\nNovember 8, 2008\n\n\nA d v e r t...",[National road toll devices to be tested by dr...,[http://web.archive.org/web/20081109030644/htt...,,"[Infowars, 9/11 Blogger , Tom Eley, AFP, Pres...",2008-11-09
20081109030651,"[\nInfowars\nNovember 8, 2008\n\n\nA d v e r t...",[National road toll devices to be tested by dr...,[http://web.archive.org/web/20081109030651/htt...,,"[Infowars, 9/11 Blogger , Tom Eley, Press TV,...",2008-11-09


In [4]:
df = read_json('infowars_pre2016_cleaned.json')

In [9]:
df['contents'][0][2]

'\n Press TV\nThursday, Oct 23, 2008\nTehran has warned that any attack on the country would be ‘insanity’ because in this case the war would be extended beyond Iran’s borders.\n“Iran will not be confined to its borders in responding to any aggression against its territory,” Fars news agency quoted Brig. Gen. Mohammad-Baqer Zolqadr a senior military official, as saying on Wednesday.\nHe played down the threats made by Israel against Iran and said: “The Israeli regime is too weak to launch any attack on a great power like Iran.”\n \nZolqadr added that all Muslim nations would back Iran against its enemies if a military action were launched against it.\n“If aggressive powers wage a war against Iran, certainly they will not be able to determine when the war will end and it will be up to the Iranian nation to determine such a war’s outcome,” he noted.\n“The Iranian nation has always proved that they will make history in defending their country against the enemies,” Zolqadr concluded.\nThe 