# Web Scraping + File I/O

##### Today's Topics:
1. Urllib and Beautiful Soup
2. Selenium
3. File Input/Output

* This is likely the most important day in the course (along with day05 on APIs).  
* You will use all the modules here if you want to scrape the internet.

***

### 1. Web Scraping (without APIs)

* Web scraping is the art of extracting data from websites and delivering it in formats like JSON, CSV, HTML, PDF, etc.
* Web scraping can be done either by using coding languages like Python, or by using data extraction APIs (Day 5).

##### Benefits 

1. Time-saving
2. Data accuracy
3. Cost-effective 

##### Ethics 

- Use a Public API when available and avoid scraping all together if the data you are looking if available through the API
- Only scrape when it is legal! 
    - NOT all sites can be legally scraped. Please don't get sued. 
    - Always check terms of service.
    - When in doubt, ask or don't do it. 
- Be polite and don't break websites
    - Scrape your data at a reasonable rate and control the number of requests per second. 
    - You don't want the website owner to think it as a DDoS attack. 

##### Overview of Web Scraping (without APIs)

1. Call the website and open it
2. Extract or load all the html code (you can store it locally for later use)
3. Retrieve information using the names of the tags, ids, etc. 
4. Store the data in to files (like csv)

#### 1.1 The Skeleton HTML Layout

In [None]:
# <!DOCTYPE html> <html>
# <head>
# <title> Page Title </title>
# </head>
# <body>

# <h1>My first heading </h1>
# <p>My first paragraph. </p>

# </body> 
# </html>

_See https://www.w3schools.com/tags/default.asp for a list of HTML tags_

##### Let's look at some source code!

* Now go to https://polisci.wustl.edu/people/88/ 
* Click right, then View Page Source or (more likely) Inspect

##### 

#### 1.2 Web Crawlers

##### We mainly use two libraries: `urllib` and `BeautifulSoup`

1. `urllib`:
    - web crawler 
    - navigates to a url
2. `BeautifulSoup`
    - parses a downloaded HTML


Useful when:
- Info is contained in HTML (not served by JavaScript)
- Encoded HTML follows predictable pattern
- Example: https://www.presidency.ucsb.edu/documents/app-categories/presidential

Beautiful Soup documentation: 
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

* Run the line below in a Jupyter cell if not installed alreay

In [12]:
# ! pip install beautifulsoup4

In [13]:
from bs4 import BeautifulSoup
import urllib.request

#### 1.3 Example (WUSTL Political Science Webpage):

1. Open a web page

In [21]:
web_address = 'https://polisci.wustl.edu/people/88/'
web_page = urllib.request.urlopen(web_address)
web_page #stored on machine

zsh:1: command not found: pip


SSLError: HTTPSConnectionPool(host='polisci.wustl.edu', port=443): Max retries exceeded with url: /people/88/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))

2. Parse it

In [144]:
soup = BeautifulSoup(web_page.read())
# print(soup)
# print(soup.prettify()) # enable us to view how tags are nested in the document

In [145]:
str(soup.prettify())[0:1500]

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n  <link href="https://polisci.wustl.edu/sites/all/themes/olympian/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>\n  <meta content="Drupal 7 (http://drupal.org)" name="generator"/>\n  <link href="https://polisci.wustl.edu/people/88" rel="canonical"/>\n  <link href="https://polisci.wustl.edu/people/88" rel="shortlink"/>\n  <meta content="Department of Political Science" property="og:site_name"/>\n  <meta content="article" property="og:type"/>\n  <meta content="https://polisci.wustl.edu/people/88" property="og:url"/>\n  <meta content="Faculty" property="og:title"/>\n  <meta content="Faculty" itemprop="name"/>\n  <title>\n   Faculty | Department of Political Science\n  </title>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content=" Department of Political Science " name="author"/>\n  <link href="https://polisci.wustl.e

3. Find all cases of a certain tag 'a'

In [146]:
soup.find_all('a')[2:10] # Returns a list... remember this!

[<a href="/"> Department of Political Science </a>,
 <a class="first-level" href="/undergraduate">Undergraduate Program</a>,
 <a class="first-level" href="/graduate-program">Graduate Program</a>,
 <a class="second-level" href="/MA-Statistics">Master’s Degree in Statistics for Political Science Ph.D. Students</a>,
 <a class="first-level" href="/research">Research</a>,
 <a class="first-level" href="/people">Our People</a>,
 <a class="first-level" href="/resources">Resources</a>,
 <a class="first-level" href="https://gifts.wustl.edu/index.html?other_designation_description=Political%20Science">Make a Gift to Political Science</a>]

4. Find all cases of a certain tag `<h3>`

In [147]:
soup.find_all('h3')[2:12]

[<h3><div><span>Deniz</span> </div><div><span>Aksoy</span></div></h3>,
 <h3><div><span>Zoe</span> </div><div><span>Ang</span></div></h3>,
 <h3><div><span>Timm</span> </div><div><span>Betz</span></div></h3>,
 <h3><div><span>Zachary</span> </div><div><span>Bowersox</span></div></h3>,
 <h3><div><span>Christina L.</span> </div><div><span>Boyd</span></div></h3>,
 <h3><div><span>Daniel </span> </div><div><span>Butler</span></div></h3>,
 <h3><div><span>Taylor</span> </div><div><span>Carlson</span></div></h3>,
 <h3><div><span>David</span> </div><div><span>Carter</span></div></h3>,
 <h3><div><span>Dino P.</span> </div><div><span>Christenson</span></div></h3>,
 <h3><div><span>Brian F.</span> </div><div><span>Crisp</span></div></h3>]

5. Extract text from the tag

In [148]:
names = soup.find_all('h3') # list of html entries
[i.text for i in names][0:10] # grab just the text from each one

['Select a Person Type',
 'Areas of Interest',
 'Deniz\xa0Aksoy',
 'Zoe\xa0Ang',
 'Timm\xa0Betz',
 'Zachary\xa0Bowersox',
 'Christina L.\xa0Boyd',
 'Daniel \xa0Butler',
 'Taylor\xa0Carlson',
 'David\xa0Carter']

* We can create an object containing all elements with the tag `<a>`. Then, get the attributes

In [43]:
all_a_tags = soup.find_all('a')
# all_a_tags
all_a_tags[36].attrs  # returns a dictionary with the attributes

{'href': '/people/amy-gais', 'class': ['card']}

* Access the attributes with key-value syntax

In [53]:
all_a_tags[36].attrs.keys()

dict_keys(['href', 'class'])

In [44]:
all_a_tags[36]['href']

'/people/amy-gais'

In [45]:
all_a_tags[36]['class']

['card']

In [46]:
for i in range(34,40):
  print(all_a_tags[i]['href'])

/people/justin-fox
/people/matthew-gabel
/people/amy-gais
/people/james-l-gibson
/people/matthew-hayes
/people/clarissa-rile-hayward


##### Some notes

*  Careful for the first and last tags—these can often be different than the others

In [47]:
all_a_tags[0].attrs

{'class': ['skip-link', 'screen-reader-text'], 'href': '#content'}

* Because `all_a_tags` is a list, we need to index the element(s) we're interested in
* If we are interested in the first instance of the tag `<a>`, we can use

In [48]:
soup.find('a')

<a class="skip-link screen-reader-text" href="#content">Skip to content</a>

In [49]:
soup.find('a').attrs 

{'class': ['skip-link', 'screen-reader-text'], 'href': '#content'}

We can use a loop (for or while) to get and re-organize all the data.

In [50]:
l = {"class" : [], "href" : []} # create a dictionary
for p in range(20,43):
    l["class"].append(all_a_tags[p].attrs["class"]) 
    l["href"].append(all_a_tags[p].attrs["href"]) 

print(l)

{'class': [['view'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card']], 'href': ['/people/list/88', '/people/deniz-aksoy', '/people/zoe-ang', '/people/timm-betz', '/people/zachary-bowersox', '/people/christina-l-boyd', '/people/daniel-butler', '/people/taylor-carlson', '/people/david-carter', '/people/dino-p-christenson', '/people/brian-f-crisp', '/people/alfred-darnell', '/people/ted-enamorado', '/people/lee-epstein', '/people/justin-fox', '/people/matthew-gabel', '/people/amy-gais', '/people/james-l-gibson', '/people/matthew-hayes', '/people/clarissa-rile-hayward', '/people/jaclyn-kaslovsky', '/people/frank-lovett', '/people/christopher-lucas']}


##### If we are interested only in the attributes of `class = card` nested within tag 'a', we can specify this in our initial `find_all()` call:

In [56]:
soup.find_all('a', {'class' : "card"})[0:2] # returns a list

[<a class="card" href="/people/deniz-aksoy"><article class="faculty-post"><div class="image"><img alt="Headshot of Deniz Aksoy" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/People/Polisci_Aksoy_D_P1013892.jpg?itok=AvCgVb4x"/><h3><div><span>Deniz</span> </div><div><span>Aksoy</span></div></h3></div><div class="dept">Associate Professor of Political Science</div></article></a>,
 <a class="card" href="/people/zoe-ang"><article class="faculty-post"><div class="image"><img alt="Headshot of Zoe Ang" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/Polisci-Ang_Z_P1014211_0.jpg?itok=mVr8swa4"/><h3><div><span>Zoe</span> </div><div><span>Ang</span></div></h3></div><div class="dept">Lecturer<div class="addtitles">Undergraduate Academic Coordinator </div></div></article></a>]

##### Commonly, you will need to go level by level in an exporatory exercise to access nested tags. Here is an example:

* First get all tags `<div>`

In [57]:
sections = soup.find_all('div') 
len(sections) # check the size of the object

143

* View the FIRST `<a>` tag within the first valid `<div>` tag 

In [59]:
sections[2].a

<a class="first-level" href="/undergraduate">Undergraduate Program</a>

* Or, equivalently:

In [60]:
sections[2].find('a') 

<a class="first-level" href="/undergraduate">Undergraduate Program</a>

* This gives us ALL `<a>` tags within the first valid `<div>` tag 

In [64]:
sections[2].find_all('a')[0:10] 

[<a class="first-level" href="/undergraduate">Undergraduate Program</a>,
 <a class="first-level" href="/graduate-program">Graduate Program</a>,
 <a class="second-level" href="/MA-Statistics">Master’s Degree in Statistics for Political Science Ph.D. Students</a>,
 <a class="first-level" href="/research">Research</a>,
 <a class="first-level" href="/people">Our People</a>,
 <a class="first-level" href="/resources">Resources</a>,
 <a class="first-level" href="https://gifts.wustl.edu/index.html?other_designation_description=Political%20Science">Make a Gift to Political Science</a>,
 <a class="button solid red external" href="https://artsci.wustl.edu/apply" target="_blank">Apply Today<svg class="icon"><use xlink:href="#link-out"></use></svg></a>,
 <a href="/" title="">Home</a>,
 <a href="/course_listing">Courses</a>]

* This gives us ALL `<a>` tags within the first valid `<div>` tag where `class` is 'first-level'

In [66]:
sections[2].find_all('a', {'class' : 'first-level'}) 

[<a class="first-level" href="/undergraduate">Undergraduate Program</a>,
 <a class="first-level" href="/graduate-program">Graduate Program</a>,
 <a class="first-level" href="/research">Research</a>,
 <a class="first-level" href="/people">Our People</a>,
 <a class="first-level" href="/resources">Resources</a>,
 <a class="first-level" href="https://gifts.wustl.edu/index.html?other_designation_description=Political%20Science">Make a Gift to Political Science</a>]

##### We can also create a tree of objects. Here is an example: 

Let's find Prof. Taylor Carlson's profile on the department website. 
1. Find all `<a>` tags where `class` is 'card'

In [151]:
all_people = soup.find_all('a', {'class' : "card"})
all_people[5:7]

[<a class="card" href="/people/daniel-butler"><article class="faculty-post"><div class="image"><img alt="Headshot of Daniel  Butler" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/dmbutler19.jpg?itok=jCyJREOF"/><h3><div><span>Daniel </span> </div><div><span>Butler</span></div></h3></div><div class="dept">Professor of Political Science<div class="addtitles">Director of Undergraduate Studies in Political Science</div></div></article></a>,
 <a class="card" href="/people/taylor-carlson"><article class="faculty-post"><div class="image"><img alt="Headshot of Taylor Carlson" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/PoliSci_Carlson_T_P1322772.jpg?itok=sZD5AZpt"/><h3><div><span>Taylor</span> </div><div><span>Carlson</span></div></h3></div><div class="dept">Associate Professor of Political Science<div class="addtitles">Weidenbaum Center Director of Survey Research</div></div></article></a>]

2. Manually examine where Prof. Carlson is located at. 

In [73]:
taylor = all_people[6]
taylor

<a class="card" href="/people/taylor-carlson"><article class="faculty-post"><div class="image"><img alt="Headshot of Taylor Carlson" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/PoliSci_Carlson_T_P1322772.jpg?itok=sZD5AZpt"/><h3><div><span>Taylor</span> </div><div><span>Carlson</span></div></h3></div><div class="dept">Associate Professor of Political Science<div class="addtitles">Weidenbaum Center Director of Survey Research</div></div></article></a>

3. Find the heading that contains Prof. Carlson's first and last name.

In [76]:
taylor.find_all('h3')
# taylor.find('h3').text

[<h3><div><span>Taylor</span> </div><div><span>Carlson</span></div></h3>]

4. Check the contents contained within this `<a>` tag for Prof. Carlson. 
Notice that this is basically the same output as above, but without the `<a></a>` tags. So it is returning everything nested within the 'a' tag.

In [77]:
taylor.contents

[<article class="faculty-post"><div class="image"><img alt="Headshot of Taylor Carlson" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/PoliSci_Carlson_T_P1322772.jpg?itok=sZD5AZpt"/><h3><div><span>Taylor</span> </div><div><span>Carlson</span></div></h3></div><div class="dept">Associate Professor of Political Science<div class="addtitles">Weidenbaum Center Director of Survey Research</div></div></article>]

* This is an iterator
* Remember: iterators are objects that we access with loops

In [78]:
taylor.children

<list_iterator at 0x10b448550>

5. Print all nested elements within 'taylor'

In [79]:
# there is only one child element in this case
for i, child in enumerate(taylor.children):
    print("Child %d: %s" % (i,child), '\n') 

Child 0: <article class="faculty-post"><div class="image"><img alt="Headshot of Taylor Carlson" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/PoliSci_Carlson_T_P1322772.jpg?itok=sZD5AZpt"/><h3><div><span>Taylor</span> </div><div><span>Carlson</span></div></h3></div><div class="dept">Associate Professor of Political Science<div class="addtitles">Weidenbaum Center Director of Survey Research</div></div></article> 



##### Let's now look at sibling tags of `taylor`

In [80]:
# Siblings (Example):

# <html>
#   <body>
#       <a>
#         <b>
#          text1
#         </b>
#         <c>
#          text2
#         </c>
#       </a>
#   </body>
# </html>


# Which two tags are on the same level? 

* See siblings *after* `taylor` in the sequence of `<a>` tags

In [155]:
b = 0
for sib in taylor.next_siblings:
    if b<2:
        print(sib)
        b += 1
    else:
        break

<a class="card" href="/people/david-carter"><article class="faculty-post"><div class="image"><img alt="Headshot of David Carter" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/People/Polisci_Carter_D_P1044474.jpg?itok=s0k5djSC"/><h3><div><span>David</span> </div><div><span>Carter</span></div></h3></div><div class="dept">Professor of Political Science</div></article></a>
<a class="card" href="/people/dino-p-christenson"><article class="faculty-post"><div class="image"><img alt="Headshot of Dino P. Christenson" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/People/dino%20christenson.jpg?itok=gnaOMPPx"/><h3><div><span>Dino P.</span> </div><div><span>Christenson</span></div></h3></div><div class="dept">Professor of Political Science<div class="addtitles">Director of the Environmental Policy Major</div></div></article></a>


* Or see siblings *before* `taylor` in the sequence of `<a>` tags

In [156]:
b = 0
for sib in taylor.previous_siblings:
    if b<2:
        print(sib)
        b += 1
    else:
        break
# What is happening?

<a class="card" href="/people/daniel-butler"><article class="faculty-post"><div class="image"><img alt="Headshot of Daniel  Butler" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/dmbutler19.jpg?itok=jCyJREOF"/><h3><div><span>Daniel </span> </div><div><span>Butler</span></div></h3></div><div class="dept">Professor of Political Science<div class="addtitles">Director of Undergraduate Studies in Political Science</div></div></article></a>
<a class="card" href="/people/christina-l-boyd"><article class="faculty-post"><div class="image"><img alt="Headshot of Christina L. Boyd" src="https://polisci.wustl.edu/files/polisci/styles/testimonial_desktop/public/People/boyd%20headshot.png?itok=SqOYQrUl"/><h3><div><span>Christina L.</span> </div><div><span>Boyd</span></div></h3></div><div class="dept">Professor of Law and Political Science</div></article></a>


#### 1.4 Crawler Detection

##### Crawlers are incredibly fast, but also easier to detect and block. 

You can incorporate some pauses to avoid detection. Strategies include: 

1. Using a random number generator to sleep for a random number of seconds
2. After each iteration, sleep for a fixed number of seconds

##### Import module `random` to generate random numbers, and module `time` to control the pauses in your code

* Random-second pause approach

In [83]:
import random
import time

# Script will pause for n seconds
time.sleep(random.uniform(1, 5))
print('Pause Ended')

Pause Ended


* Fixed-second pause approach

In [84]:
time.sleep(5)
print('done')

done


#### 1.5 Remote Drivers

##### Selenium is a “remote driver” of your favorite browser. 

* You can pretty much simulate behavior of a human “surfing the web”. 
* With the right tricks, the likelihood of tracking and blocking your “bot” decreases.
* It also offers flexibility in terms of “unknown” items: you can even look by name of buttons in the page. 

##### There are some downsides though...
  - It is slower
  - It is dependent on your internet connection quality

##### Let's walk through an example using Selenium

* If you haven't already, make sure to install `Selenium` by running
    * `pip install selenium` in terminal or command line, or
    * `!pip install selenium` in a Jupyter notebook cell

download appropriate web driver from browser, e.g. https://chromedriver.chromium.org/downloads


In [87]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys

1. Give the path to your driver.

In [93]:
# Interactive example:
driver_path = Service('/Users/almavelazquez/Documents/GitHub/PythonCamp2024/Day04/Lecture/chromedriver')
driver = webdriver.Chrome(service = driver_path)

2. Start the web driver

In [94]:
driver.get('https://www.google.com')

3. Find the search element and enter text

In [95]:
search = driver.find_element("name", "q")
search.send_keys('WUSTL Political Science')

4. press Enter / Return (simulate this action using your driver)

In [96]:
search.submit()

5. Close the browser (make sure to always close your browser after web scraping)

In [97]:
driver.close()

### 2. Combining Approaches

##### Let's combine the approaches and scrape some data from the Iceland Parliament! 

In [99]:
# Define Webpage
url = "https://www.althingi.is/altext/cv/en/"
# Use a crawler to get all pages for MPs
web_page = urllib.request.urlopen(url)
# Parse the HTML
soup = BeautifulSoup(web_page.read())#, "html.parser") # html.parser severs as a basis for parsing text files in HTML format
# Get all urls
mps = soup.find('table').find_all('a', href = True)
mps[0:10]

[<a href="?nfaerslunr=179">Andrés Ingi Jónsson</a>,
 <a href="?nfaerslunr=223">Arndís Anna Kristínardóttir Gunnarsdóttir</a>,
 <a href="?nfaerslunr=224">Ágúst Bjarni Garðarsson</a>,
 <a href="?nfaerslunr=182">Áslaug Arna Sigurbjörnsdóttir</a>,
 <a href="?nfaerslunr=121">Ásmundur Einar Daðason</a>,
 <a href="?nfaerslunr=167">Ásmundur Friðriksson</a>,
 <a href="?nfaerslunr=245">Ásthildur Lóa Þórsdóttir</a>,
 <a href="?nfaerslunr=225">Berglind Ósk Guðmundsdóttir</a>,
 <a href="?nfaerslunr=213">Bergþór Ólason</a>,
 <a href="?nfaerslunr=36">Birgir Ármannsson</a>]

* Create objects to store the data

In [101]:
page = []
name = []
party = []
email = []

* Run the following process for the first 2 cases

In [102]:
for i in range(0, 2):
  print(i)
  page.append(url + mps[i]['href'])
  driver = webdriver.Chrome(service = driver_path)
  driver.get(page[i])
  html = driver.page_source
  driver.close()
  soup = BeautifulSoup(html)
  name.append(soup.find(class_ = 'article box news').find('h1').text)
  soup = soup.find(class_ = 'article box news').find('div', class_ = 'person')
  party.append(soup.find(class_ = 'office').find_all('li')[1].text)
  email.append(soup.find(class_ = 'contactinfo first notexternal').find('a', href = True)['href'].split(":")[1])
  # time.sleep(5)

0
1


* Examine the outputs

In [103]:
email

['andresingi@althingi.is', 'arndisanna@althingi.is']

In [104]:
party

['Party: Pirate Party', 'Party: Pirate Party']

##### Scraping Tips
- Google Chrome is better to track nodes and page sources
- Inspect the source and get to know your document/website!
- Selenium—Use the ’Copy Xpath’ command if you’re having troubles (Find it in "Inspect" in Google Chrome)
- Use time breaks to avoid being blocked and be polite
- Check the Terms of Service (whether you obey them or not). Please don't get sued. 


##### More on Selenium: https://selenium-python.readthedocs.io/locating-elements.html

### 3. Reading and Writing Files 

#### 3.1 Reading Files

1. Import libraries

In [105]:
# import sys
import os

2. View your working directory

In [109]:
os.getcwd()

'/Users/almavelazquez/Documents/GitHub/PythonCamp2024/Day04/Lecture'

3. Set your working directory 

In [110]:
os.chdir('/Users/almavelazquez/Documents/GitHub/PythonCamp2024/Day04/Lecture')

3. Read lines from the file

In [111]:
# Read all lines as one string
with open('readfile.txt') as f:
  the_whole_thing = f.read()
  print(the_whole_thing)

Here is line 1.
Here is line 2.
This is the final line.


In [112]:
# Read line by line
with open('readfile.txt') as f:
  lines_list = f.readlines()
  for l in lines_list:
    print(l)

Here is line 1.

Here is line 2.

This is the final line.


In [113]:
# More efficiently, we can loop over the file object (i.e. we don't need the variable lines)
with open('readfile.txt') as f:   
  for l in f:
    print(l)

Here is line 1.

Here is line 2.

This is the final line.


In [114]:
# We can also manually open and close files
# I never do this
f =  open('readfile.txt')
print(f.read())
f.close()

Here is line 1.
Here is line 2.
This is the final line.


Tips: 
- Try to minimize the number of times you open and close flies
- It is very expensive and consumes limited resources --> if too many, it leads to errors 

_Source: https://www.geeksforgeeks.org/context-manager-in-python/_


In [None]:
# file_descriptors = [] 
# for x in range(100000000000): 
#     file_descriptors.append(open('readfile.txt')) 

#### 3.2 Writing Files

1. Writing files is easy, but be careful not to overwrite the content you actually want
2. See https://stackabuse.com/file-handling-in-python/ for more options

* We need to use the option 'w'

In [115]:
with open('test_writefile.txt', 'w') as f:
  ## wipes the file clean and opens it
  f.write("Hi guys.")
  f.write("Does this go on the second line?")
  f.writelines(['a\n', 'b\n', 'c\n'])

In [None]:
# We use 'a' to append new information to it
with open('test_writefile.txt', 'a') as f:
  f.write("I got appended!")

##### Writing CSV files (pre-pandas)

1. Import csv

In [116]:
import csv

2. Open a file stream and create a `csv` writer object

In [117]:
# Open a file stream and create a CSV writer object
with open('test_writecsv.csv', 'w') as f:
  my_writer = csv.writer(f)
  for i in range(1, 100):
    my_writer.writerow([i, i-1])

3. Now read the `csv` file

In [122]:
with open('test_writecsv.csv', 'r') as f:
  my_reader = csv.reader(f)
  mydat = []
  for row in my_reader:
    mydat.append(row)
print(mydat[0],"\n", mydat[1],"\n", mydat[2],"\n", mydat[3])

['1', '0'] 
 ['2', '1'] 
 ['3', '2'] 
 ['4', '3']


4. Add column names 

In [124]:
# Note that we are writing a new file
with open('test_csvfields.csv', 'w') as f:
  my_writer = csv.DictWriter(f, fieldnames = ("A", "B"))
  my_writer.writeheader()
  for i in range(1, 100):
    my_writer.writerow({"B":i, "A":i-1})

5. Read the new file

In [135]:
b = 0
with open('test_csvfields.csv', 'r') as f:
  my_reader = csv.DictReader(f)
  for row in my_reader:
      if b<5:
          print(row)
          b +=1

{'A': '0', 'B': '1'}
{'A': '1', 'B': '2'}
{'A': '2', 'B': '3'}
{'A': '3', 'B': '4'}
{'A': '4', 'B': '5'}


##### Some Tips

- Tip 1: We may find useful to save webpages for collecting data (to `.html` files)

In [22]:
import os

In [137]:
def download_page(address, filename, wait = 5):
  time.sleep(random.uniform(0,wait))
  page = urllib.request.urlopen(address)
  page_content = page.read()
  if os.path.exists(filename) == False:
    with open(filename, 'w') as p_html:
      p_html.write(str(page_content)) # needed to cast as string
  else:
    print("Can't overwrite file " + filename)

download_page('https://polisci.wustl.edu/people/88/', "polisci_ppl.html")

Then, we can parse a page that is already saved on your computer even without access to internet. 

In [138]:
with open('polisci_ppl.html') as f:
  myfile = f.read()
  soup = BeautifulSoup(myfile)
# soup.prettify()

- Tip 2: You may also write directly from a website to a `csv` file. This is good practice as it ensures a break 10 hours into the process does not erase all of your data. 
- Tip 3: Use Exception Handling techniques that we covered in Day03

In [139]:
with open('iceland_test.csv', 'w') as f: # set up with the writer
  w = csv.DictWriter(f, fieldnames = ("name", "party", "phone")) # define column names
  w.writeheader() # write the header
  web_address='https://www.althingi.is/altext/cv/en/' # the web address
  web_page = urllib.request.urlopen(web_address) # open the web page
  soup = BeautifulSoup(web_page.read()) # soup the web page
  all_members = soup.find_all('tr') # find the list of names and parties
  for i in range(1,3): # for members 1 and 2 (member 0 is just the table heading)
    # you should also add try/except language to ensure a weird item doesn't break your whole scraper
    try:
      member = {} ## empty dictionary to fill in
      member_i = all_members[i].find_all('td') # subset lower to each individual item
      member["name"] = member_i[0].text # member's name
      member['party'] =  member_i[1].text # member's party
      inner_page_url = web_address + member_i[0].a['href'] # get the extension to their personal page
      inner_page = urllib.request.urlopen(inner_page_url) # open the personal page
      inner_soup = BeautifulSoup(inner_page.read()) # soup the personal page
      member['phone'] = inner_soup.find('a', {'class' : 'tel'}).text # get phone number
    except:
      member['name'] = 'NA'
      member['party'] = 'NA'
      member['phone'] = 'NA'
    w.writerow(member) # write the row for this specific member
    time.sleep(random.uniform(1, 5)) # be polite, sleep!

In [None]:
# Copyright of the original version:

# Copyright (c) 2014 Matt Dickenson
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.