# CSC271: Reading static structured data from the web

## Lecture Topics
- Reading static structured files from the web

## Introduction

In the lecture prep, you learned about the `requests` module and how to make requests. We're going to explore reading static structured files from the web programmatically.

The Elections Ontario website contains reports on voters and voter turnout:
https://results.elections.on.ca/en/publications



We are going to use the "Statistical Summary for previous General Elections" report. If we right-click on the link on the webpage, we can copy the CSV file's URL:
https://results.elections.on.ca/api/report-groups/48/report-outputs/1088/csv

Now we'll use `requests` to read that CSV data into our program:

In [1]:
import requests
import pandas as pd

url = "https://results.elections.on.ca/api/report-groups/48/report-outputs/1088/csv"

response = requests.get(url)

# Ask requests to guess which character encoding is used so the text displays correctly
response.encoding = response.apparent_encoding

if response.status_code == 200:
    print(response.text)
else:
    print(f'The request was not successful: status code{response.status_code}')

Year,NumberOfCandidates,NumberOfNamesRegistered,NumberOfNamesRevised,TotalNumberOfNames,RuralPollingPlaces,UrbanPollingPlaces,AdvancePollingPlaces,ValidBallotsFromBoxes,RejectedBallotsFromBoxes,UnmarkedBallotsFromBoxes,DeclinedBallots,VoterTurnoutTotal,VoterTurnoutPercentageOfList,FootNoteEnglish,FootNoteFrench
2025,768,10807901,375685,11183586,0,7279,800,5022142,17789,11296,5706,5056933,0.4521745529564488528,,
2022,897,10695373,45053,10740426,0,7277,795,4701959,15587,10685,4245,4732476,0.4406227462486124852,,
2018,823,9809118,436948,10246066,0,7648,771,5744860,15832,22910,22684,5806286,0.5666844230751587975,"Reduction reflects a restructuring of polling divisions from the introduction of technology, with no impact on elector allocations","La réduction s'explique par la restructuration des sections de vote qui a été opèrée dans le cadre de la mise en place des équipments technologiques, sans aucune incidence sur la répartition des électeurs"
2014,615,9248764,277267,9526031,0,24815,913,

<div class="alert alert-block alert-success">

<h3>Character encodings</h3>
  A character encoding is the convention used to represent each character numerically. Examples of encodings:
  <ul>
  <li>UTF-8: any Unicode character including English, Chinese, Arabic, emoji, etc.</li>
  <li>ISO-8859-1: mostly Western European characters, including English, French, German, etc.</li>
  <li>Windows-1251: Cyrillic characters, including Russian, Bulgarian, etc.</li>
  </ul>

  UTF-8 is the most common encoding.

  Source: https://en.wikipedia.org/wiki/Character_encoding
</div>

### StringIO 

The code above returns the contents of the CSV file as a single long string. Below we print the length of the string and its first 500 characters.

In [None]:
# TODO: print the length of the string
print(len(response.text))

# TODO: print the first 500 characters of the text
#print(repsonse.text ...)


4798


We want to be able to easily split the string into rows of data, but don't want to have to do it manually by considering the locations of the commas and newline characters. Wouldn't it be nice to be able to use `pd.read_csv`? Well, we can!

There is a module named `StringIO` that allows us to treat a string as though it were a file. Any function that accepts a `TextIO` argument (the type we receive when we call `open`), will accept a `StringIO` argument.

Once we've done produced our `StringIO` object, we can call `pd.read_csv` like usual:

In [4]:
from io import StringIO

# TODO: Treat the text as though it is a file object
elections = StringIO(response.text)

# TODO: produce a DataFrame based on the text
df = pd.read_csv(elections)

df.head()

Unnamed: 0,Year,NumberOfCandidates,NumberOfNamesRegistered,NumberOfNamesRevised,TotalNumberOfNames,RuralPollingPlaces,UrbanPollingPlaces,AdvancePollingPlaces,ValidBallotsFromBoxes,RejectedBallotsFromBoxes,UnmarkedBallotsFromBoxes,DeclinedBallots,VoterTurnoutTotal,VoterTurnoutPercentageOfList,FootNoteEnglish,FootNoteFrench
0,2025,768,10807901,375685,11183586,0,7279,800,5022142,17789,11296,5706,5056933,0.452175,,
1,2022,897,10695373,45053,10740426,0,7277,795,4701959,15587,10685,4245,4732476,0.440623,,
2,2018,823,9809118,436948,10246066,0,7648,771,5744860,15832,22910,22684,5806286,0.566684,Reduction reflects a restructuring of polling ...,La réduction s'explique par la restructuration...
3,2014,615,9248764,277267,9526031,0,24815,913,4820547,22885,12124,29937,4885493,0.512857,,
4,2011,655,8761095,239280,9000375,0,24179,907,4316382,12892,5208,2335,4336817,0.481848,,


Once we've read in the data from the website, we can use it in our programs as we would any other DataFrame (after cleaning and preparation). For example, we can filter to see only the years with at least 70% voter turnout.

In [5]:
high_df = df[df['VoterTurnoutPercentageOfList'] >= 0.7]
high_df.head()

Unnamed: 0,Year,NumberOfCandidates,NumberOfNamesRegistered,NumberOfNamesRevised,TotalNumberOfNames,RuralPollingPlaces,UrbanPollingPlaces,AdvancePollingPlaces,ValidBallotsFromBoxes,RejectedBallotsFromBoxes,UnmarkedBallotsFromBoxes,DeclinedBallots,VoterTurnoutTotal,VoterTurnoutPercentageOfList,FootNoteEnglish,FootNoteFrench
15,1971,384,4503142,0,4503142,4980,13267,470,3292717,18059,0,0,3310776,0.735215,,
22,1945,317,2165793,304167,2469960,5,0,106,1765793,15921,225,1451,1783390,0.722032,,
24,1937,266,2228030,200,2228230,6,0,80,1571133,15789,183,1299,1588404,0.712855,,
25,1934,261,2330420,-200000,2130420,9968,0,102,1561826,15683,0,1584,1579093,0.741212,,
29,1919,288,1337387,41334,1378721,0,7986,0,1176541,39262,0,8640,1224443,0.888101,,
