<br><br><font color="gray">DOING COMPUTATIONAL SOCIAL SCIENCE<br>MODULE 3 <strong>PROBLEM SETS</strong></font>

# <font color="#49699E" size=40>MODULE 3 </font>


# What You Need to Know Before Getting Started

- **Every notebook assignment has an accompanying quiz**. Your work in each notebook assignment will serve as the basis for your quiz answers.
- **You can consult any resources you want when completing these exercises and problems**. Just as it is in the "real world:" if you can't figure out how to do something, look it up. My recommendation is that you check the relevant parts of the assigned reading or search for inspiration on [https://stackoverflow.com](https://stackoverflow.com).
- **Each problem is worth 1 point**. All problems are equally weighted.
- **The information you need for each problem set is provided in the blue and green cells.** General instructions / the problem set preamble are in the blue cells, and instructions for specific problems are in the green cells. **You have to execute all of the code in the problem set, but you are only responsible for entering code into the code cells that immediately follow a green cell**. You will also recognize those cells because they will be incomplete. You need to replace each blank `▰▰#▰▰` with the code that will make the cell execute properly (where # is a sequentially-increasing integer, one for each blank).
- Most modules will contain at least one question that requires you to load data from disk; **it is up to you to locate the data, place it in an appropriate directory on your local machine, and replace any instances of the `PATH_TO_DATA` variable with a path to the directory containing the relevant data**.
- **The comments in the problem cells contain clues indicating what the following line of code is supposed to do.** Use these comments as a guide when filling in the blanks. 
- **You can ask for help**. 

Finally, remember that you do not need to "master" this content before moving on to other course materials, as what is introduced here is reinforced throughout the rest of the course. You will have plenty of time to practice and cement your new knowledge and skills.
<div class='alert alert-block alert-danger'>As you complete this assignment, you may encounter variables that can be assigned a wide variety of different names. Rather than forcing you to employ a particular convention, we leave the naming of these variables up to you. During the quiz, submit an answer of 'USER_DEFINED' (without the quotation marks) to fill in any blank that you assigned an arbitrary name to. In most circumstances, this will occur due to the presence of a local iterator in a for-loop.</b></div>

## Package Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from IPython.core.display import display, HTML
import numpy as np
from time import sleep
from pprint import pprint

In [33]:
import warnings
warnings.filterwarnings("ignore")

## Problem 1:

<div class='alert alert-block alert-info'>
Wading into endlessly nested JSON structures and sorting through text that's full of HTML tags can be a bit daunting. Thankfully, the well-structured nature of API responses means that more straightforward numeric data is usually easily accessed. Before starting, make sure your `cred.py` is properly setup like the example in the chapter.
</div>
<div class='alert alert-block alert-success'>
In this exercise, you will search the Guardian API for the term "morrison" between the dates of 2019-05-18 and 2019-05-19 - the day Australian Prime Minister Scott Morrison won an election and the day after. Using a loop, search for the term in each of the three production offices ('aus', 'uk', and 'us') and store each result in a list. 
</div>

In [2]:
import cred

# API key provided by the Guardian
GUARDIAN_KEY = cred.GUARDIAN_KEY

# Initialize Constants:
API_ENDPOINT = 'http://content.guardianapis.com/search' 
office_list = ['aus','uk','us']

# Set up the request parameters, including authorization
PARAMS = {'api-key': GUARDIAN_KEY,                
             'from-date': '2019-05-18',
             'to-date': '2019-05-19',
             'q': 'morrison'}

# Initialize list for storing results from iteration
morrison_list = []

# Iterate over each of the offices in 'office list'
for office in office_list:
    
    PARAMS['production-office'] = office      # set the search filter on each pass of the loop
    response = requests.get(API_ENDPOINT, params=PARAMS)    # send the query to the Guardian API
    response_dict = response.json()['response']        # keep the relevant component of the response
    
    total_articles = response_dict['total']        # access the metafield that indicates the total number of responses
    morrison_list.append(total_articles)           # add the resulting data to the list of results
    
    print(office + ': ' + str(total_articles))        # print results

aus: 23
uk: 1
us: 0


## Problem 2:

<div class='alert alert-block alert-info'>
Next we'll grab another search term - this time 'trump' - for the same time period and the same set of three production offices. 
</div>
<div class='alert alert-block alert-success'>
Query the Guardian API for 'trump' and store the results in a list much like last time. Create a pandas dataframe out of all three lists, giving each column a reasonably descriptive name. 
</div>

In [7]:
# Initialize list for storing results from iteration
trump_list = []

#Modify the search term
PARAMS['q'] = 'trump'                  

# Iterate over each of the offices in 'office list'
for office in office_list:
    
    PARAMS['production-office'] = office # set the search filter on each pass of the loop
    
    response = requests.get(API_ENDPOINT, params=PARAMS)  # send the query to the Guardian API
    response_dict = response.json()['response'] # keep the relevant component of the response
    
    total_articles = response_dict['total'] # access the metafield that indicates the total number of responses
    trump_list.append(total_articles)       # add the resulting data to the list of results

# Initialize an empty dataframe and create columns from the lists
df = pd.DataFrame()
df['office'], df['morrison'], df['trump'] = office_list, morrison_list, trump_list       

# See the whole (small) dataframe
df.head()

Unnamed: 0,office,morrison,trump
0,aus,23,6
1,uk,1,20
2,us,0,18


## Problem 3:

<div class='alert alert-block alert-info'>
Let's take a look at what kind of stories from the UK office were mentioning Trump in that time period.
</div>
<div class='alert alert-block alert-success'>
Reconfigure the appropriate `PARAMS` dictionary entries to carry out the search, adding 'headline' to the request. Retrieve the headlines from each article returned in the response, store them in a list, and take a look at the topics suggested by the headlines! 
</div>

In [13]:
# Initialize list for storing results from iteration
trump_uk_headlines = []

# Change some of the request parameters again
PARAMS['q'] = 'trump'                          
PARAMS['production-office'] = 'uk'
PARAMS['show-fields'] = 'headline'


response = requests.get(API_ENDPOINT, params=PARAMS)  # send the query to the Guardian API
response_dict = response.json()['response']           # keep the relevant component of the response

# Iterate over each of the responses
for resp in response_dict['results']:
    headline = resp['fields']['headline']        # process the new result field
    trump_uk_headlines.append(headline)          # add the resulting data to the list of results
    
# View the list you just finished making
pprint(trump_uk_headlines) 

["BP pushed for Arctic drilling rights after Trump's election",
 "'Tariff': what's in the word behind Trump's tit-for-tat trade war?",
 'Jeremy Kyle Show: why take so long to end this daily humiliation?',
 'Old grudges, new weapons… is the US on the brink of war with Iran?',
 'Too much has been sacrificed to allow Brexit to destroy Europe’s unity',
 'Footballer Héctor Bellerín calls on sport to oppose Alabama abortion ban',
 'What my queer journey taught me about love',
 "Julianna Margulies on her shocking Ebola drama: 'I panicked in my hazmat "
 "suit!'",
 "Theresa May prepares 'bold' last-ditch offer to MPs on Brexit bill",
 'Don’t lead us to disaster, moderate Tories warn frontrunner Boris Johnson']


## Problem 4:

<div class='alert alert-block alert-info'>
Word counts are also often at least a byproduct of many API processes, so they will often be available. Let's see if there was much difference in the average word count of articles mentioning Trump, across all production offices, compared to those mentioning Morrison.
</div>
<div class='alert alert-block alert-success'>
Modify the relevant entries in `PARAMS`, retrieving the wordcount for each article in the response text. Results should be stored in a list for each politician to calculate and print the average word counts.
</div>

In [14]:
# Initialize lists for storing results from iteration
morrison_counts = []
trump_counts = []

# Need to request a new field and clear the office filter
PARAMS['q'] = 'morrison'
PARAMS['production-office'] = None
PARAMS['show-fields'] = 'wordcount'             

response = requests.get(API_ENDPOINT, params=PARAMS)   # send the query to the Guardian API
response_dict = response.json()['response']            # keep the relevant component of the response 

# Fill the list of words counts for the first search
for resp in response_dict['results']:
    wordcount = resp['fields']['wordcount']       # retrieve the word count
    morrison_counts.append(int(wordcount))        # int() is needed because the API returns strings

PARAMS['q'] = 'trump'
response = requests.get(API_ENDPOINT, params=PARAMS)   # send the query to the Guardian API
response_dict = response.json()['response']            # keep the relevant component of the response

# Fill the list of words counts for the second search
for resp in response_dict['results']:
    wordcount = resp['fields']['wordcount']
    trump_counts.append(int(wordcount))

# Calculate the average of the word counts
morrison_avg = np.mean(morrison_counts)        
print(morrison_avg)
trump_avg = np.mean(trump_counts)
print(trump_avg)

629.0
804.9


## Problem 5:

<div class='alert alert-block alert-info'>
For this exercise, we're going to use the power of scraping to delve into the results of the 2019 Canadian Federal Election! Just as every journey begins with a single step, we're going to start out with some basics. 
</div>
<div class='alert alert-block alert-success'>
Use the URL for Wikipedia's riding-by-riding election results to retrieve the web page and then check to see the if the result from the server was ok.
</div>

In [27]:
# Store our website address in the 'url' variable
url = "https://en.wikipedia.org/wiki/Results_of_the_2019_Canadian_federal_election"

# Retrieve the website
r = requests.get(url)

# Query as to whether or not our request was 'ok' and display result
print(r.ok)

True


## Problem 6:

<div class='alert alert-block alert-info'>
Now that we have the HTML in hand, let's use BeautifulSoup to process the website and get its on-screen title (the text that's immediately above the 'From Wikipedia, the free encyclopedia'). Since the on-screen title differs somewhat from the tab title that we retreived in the chapter on scraping, you might need to do a little digging in the site's HTML to figure out where it's stored. (Hint: you can find it in the 'body' of the article, and it is a type of heading)
</div>
<div class='alert alert-block alert-success'>
Process the website using BeautifulSoup and find the on-screen title of the website we retrieved. 
</div>

In [22]:
# Use BeautifulSoup to create an HTML DOM
soup = BeautifulSoup(r.content, 'lxml')

# Use the soup object to find the text of the web page's on-screen title
on_screen_title = soup.findAll("h1")[0].text

# Display Result
print(on_screen_title)

Results of the 2019 Canadian federal election


## Problem 7:

<div class='alert alert-block alert-info'>
If you scroll around on the webpage we're scraping information from, you might notice that the centrepiece of the page is a large table entitled 'Results by riding - 2019 Canadian federal election'. It contains a whole lot of data; it would be great to have access to all of it in an organized fashion! Fortunately, we can easily find tables in this article by searching for objects with the 'table' tag. Unfortunately, there are 27 such tables in the article, and we can't be certain of the order they appear in BeautifulSoup's 'findAll' results. There are many ways to programmatically ensure that you've got the correct table. In this question, we're going to take advantage of the fact that we know the name of the table we're looking for. 
</div>
<div class='alert alert-block alert-success'>
Iterate through the HTML tables in the web page to locate the "Results by riding" table.
</div>

In [28]:

# Get a list of all the tables in the web page
list_of_tables = soup.findAll("table")

# Initialize the variable we'll use to store the index
result_table_index = None

# Iterate over each table in the wikipedia article
for ii, table in enumerate(list_of_tables):
    first_row = table.findAll('tr')[0].text # Get the first row of the table
    if "Results by riding - 2019 Canadian federal election" in first_row:
        result_table_index = ii  # If we get a match, we've found our table!

# Display index of the result table
print(result_table_index)

4


## Problem 8:

<div class='alert alert-block alert-info'>
Now that we have the index of the table we want, we can easily retrieve it from our `list_of_tables`. All of the juicy, riveting details of the election are almost within reach, but before we can grasp them, we're going to need to know a bit more about how the data is organized. Since this is an HTML table, you can think of it as being built from a large number of 'rows' in the table. Take some time to play around with the table in your browser using development tools. We're interested in figuring out what HTML tag is use to denote a single row in the table (take, for example, Calgary Nose Hill - what's the tag that is used to identify its entire row in the table? Make sure you're inside the table's 'tbody' tag, and not its 'thead' tag).
</div>
<div class='alert alert-block alert-success'>
Find the HTML tag used to designate a single row of data in this table.
</div>

In [29]:
# Store the two-letter HTML tag you found as part of your investigation (in string format)
row_tag = "tr"

## Problem 9:

<div class='alert alert-block alert-info'>
Now that we have access to the rows of the table, we can look through them to find all of the data corresponding to any given riding! We might as well keep things close to 'home'; in the following code cell, we're going to produce a list of all the rows in the table's body and then locate the row corresponding to the 'Waterloo' riding. 
</div>
<div class='alert alert-block alert-success'>
Create a list of all of the rows in the 'Results by riding' table, and then locate the 'Waterloo' riding's row. 
</div>

In [30]:
# Retrieve the "Results by riding" table from the list of tables
results_table = list_of_tables[result_table_index]

# Create a list of all rows in the table by searching for the row tag you found
all_rows = results_table.findAll(row_tag)

# Iterate through list of rows to find 'Waterloo' row.
for i, row in enumerate(all_rows):
    if "Waterloo" in row.text:
        waterloo_row = row
        
# Process the results into a human-readable form:
waterloo_text = [r for r in waterloo_row.text.split("\n") if r]

print(waterloo_text)

['Waterloo', 'ON', 'Lib', 'Lib', '31,085', '48.8%', '15,470', '24.3%', '74.8%', '31,085', '15,615', '9,710', '–', '6,184', '1,112', '–', '–', '63,706']


## Problem 10:

<div class='alert alert-block alert-info'>
If everything went well, we should now have access to the 'waterloo_text' variable, which is a list of 18 strings we created from the Riding of Waterloo's row in the 'Results by riding' table. Fortunately for us, all of the other data rows should follow the exact same pattern. We can use this inter-row regularity to create a Pandas dataframe that should closely match the table in the wikipedia article. 
</div>
<div class='alert alert-block alert-success'>
Remove the first 5 rows of the 'all_rows' variable (index positions 0 to 4). Then, populate the pandas dataframe (we've filled in the column names already) with the rows from the table you scraped. Finally, find the number of ridings from each province and territory.
</div>

In [44]:

# Remove first 5 items from the 'all_rows' list
all_but_five_rows = all_rows[5:]

# Initialize a list of the columns from the Wikipedia table
riding_cols = [
    'riding',
    'province_or_territory',
    '2015_winning_party',
    '2019_winning_party',
    'votes',
    'share',
    'margin_num',
    'margin_pct',
    'turnout',
    'liberal',
    'conservative',
    'ndp',
    'bloc',
    'green',
    'ppc',
    'independent',
    'other',
    'total',
    'riding_url',
]

# Initialize dataframe
riding_df = pd.DataFrame(columns=riding_cols)

# Populate dataframe with rows
for row in all_but_five_rows:
    row_text = [r for r in row.text.replace(',','').split("\n") if r]
    row_text.append(row.find('a', href=True)['href'])

    while len(row_text) < 19:
        row_text.insert(-2, 0) # This fixes 3 broken rows

    df_row = pd.Series(row_text, index=riding_df.columns)
    riding_df = riding_df.append(df_row, ignore_index=True)
    

# Count the number of ridings from each province and territory
riding_counts = riding_df['province_or_territory'].value_counts()

riding_counts

SyntaxError: invalid syntax (848668486.py, line 43)

## Problem 11:

<div class='alert alert-block alert-info'>
If you're not a student of Canadian politics, it might surprise you to learn that in the 2019 election, the incumbent Liberal Party of Canada (Lib) recieved fewer votes than the Conservative Party of Canada (Con), and yet the Liberals won more seats than the Conservatives and formed government. This happened because each riding in the Canadian electoral system runs according to a winner-takes-all logic, where 100% of the rewards go to the cadidate who finished in first place, even if they only beat the person in second place by a single vote. This means that having a large margin of victory in a single riding is not desirable - presumably, that represents time, resources, and effort that could have been more efficiently allocated. In this question, we're going to see if we can use the data we scraped to help shed some light on why this strange state of affairs came to pass.
</div>
<div class='alert alert-block alert-success'>
Examine the twenty ridings with the highest margin of victory. Then, examine the twenty ridings with the lowest margin of victory. The margin of victory is contained in the 'margin_num' and 'margin_pct' columns.
</div>

In [None]:
# Convert the margin_num column to a numeric column
riding_df['margin_num'] = pd.to_numeric(riding_df['margin_num'])

# Sort the entire dataframe by the value of the margin number, ascending
margin_df_ascending  = riding_df.sort_values(['margin_num'], ascending=True)

# display the 20 ridings with the largest margin of victory
display(HTML('<div class="alert alert-block alert-info">Ridings with highest margin of victory</div>'))
display(margin_df_ascending.tail(20))

# display the 20 ridings with the lowest margin of victory
display(HTML('<div class="alert alert-block alert-danger">Ridings with lowest margin of victory</div>'))
display(margin_df_ascending.head(20))

## Problem 12:

<div class='alert alert-block alert-info'>
For this final problem, we're going to harness the information we've already gathered to create a scraper that's capable of semi-autonomously traversing a (vanishingly small) proportion of Wikipedia's pages. Specifically, we're going to take advantage of the fact that each of the riding entries that we scraped contains a link to Wikipedia's article on that riding. 
</div>
<div class='alert alert-block alert-success'>
Create a scraper capable of retrieving the 'Date created' date for each of the 20 ridings with the lowest margin of victory. Average all of these dates and round them to the nearest full year. Make certain that your scraper is capable of handling riding articles that do not contain a 'First Contested' date.
</div>

In [42]:

# Initialize variables
link_base = "https://en.wikipedia.org/"
year_created_list = []

# create list of first 20 rows:
first_twenty_rows = list(margin_df_ascending.iterrows())[0:20]

# retrieve 'District Created' for each link in list
for i, row in first_twenty_rows:
    sleep(0.5)

    r = requests.get(link_base + row['riding_url']) # Send request to Wikipedia
    soup = BeautifulSoup(r.content, 'lxml') # Process using beautiful soup
    for row in soup.findAll('tr'): # Iterate over our 'soup' DOM
        if 'District created' in row.text: # If we find a match, add the value.
            year_created_list.append(int(row.find('td').text))
            
# Find the average year of creation and round it to the nearest full year
avg_creation_year = round(sum(year_created_list)/len(year_created_list))

print(avg_creation_year)

1975
