# Introduction to Webscraping, 2-12-19, CRDDS Workshop Series
## Presented by Kevan Feshami & Phil White

### The following Jupyter Notebook was the basis for the Python/Beautiful Soup demonstration.

Contact Phil White with questions: [philip.white@colorado.edu](mailto:philip.white@colorado.edu)

Dependencies: if you're using Python, you will need to ensure requests and bs4 are installed. csv and datetime come standard. The easiest way to install these packages is to open your command prompt and type 'pip install bs4' and 'pip install requests'

Just press enter in each code block (cell) to execute each code element.

In [1]:
#This cell imports requests, bs4, csv, and datetime

from bs4 import BeautifulSoup as bs #used to parse html
import requests #used to call urls
import csv #reads and writes csvs
from datetime import datetime #makes a timestamp
import pandas as pd

### Basic workflow

1. Use requests.get() to call a url, and append the .text to the command to bring in the html of the page.

2. bs(source, 'lxml') tells BeautifulSoup to read/parse the html

3. Identify the tags you want to extract data from. It is helpful to go to the page and use right-click inspect element to figure out where the item you want is nested in the HTML structure.

4. In this example, soup.h2.text grabs all of the text from in between h2 tags.

5. soup.h2.a['href'] grabs the hyperlinks within an anchor tags 

6. We then create variables out of each line of code, naming them 'source' (the source data), 'soup' (the list that BeautifulSoup can read), and then 'headline' (first h2 tag text) and 'link' (first link within the first h2 tag)

7. Finally, we printed them both out.

In this example, we're only grabbing the first h2 and a tag from the html doc.

In [2]:
source = requests.get('https://www.npr.org/sections/politics/').text #get the html

In [3]:
soup = bs(source, 'lxml') #make it a list that BeautifulSoup can interpret

In [4]:
headline = soup.h2.text #grab the text within the first h2 tag

In [5]:
link = soup.h2.a['href'] #grab the link within the first h2 tag

In [6]:
#print them both
print(headline)
print(link)

Wisconsin Election Held Amid Virus Fears: Here's What You Need To Know
https://www.npr.org/2020/04/07/828055678/wisconsin-election-held-amid-virus-fears-heres-what-you-need-to-know


### Next, instead of grabbing the first h2 tag, we use 'find_all' to grab all of the h2 tags.

1. find_all gets all h2 tags. In this example, we filtered by class using class_ = 'title'. This got ride of h2 tags that contained info we don't want. We made a variable out of this called 'headlines.'
2. Then, list() was used to create an empty list. We made the list a variable called linkList
3. Next, we create a 'for loop' to iterate over each item in the headlines list. Within the loop, it finds each h2 and grabes the titles and the links (same general structure as above)
4. Finally, the loop uses an append command to add each link into a new list.

In [7]:
headlines = soup.find_all('h2', class_ = 'title') #grabs the text from each h2 tag classed as 'title'

In [8]:
linkList = list() #creates an empty list

In [9]:
for items in headlines:
    titles = items.text
    links = items.a['href']
    #print (titles)
    linkList.append(links)
    
#this loop iterates over each h2 in the headlines list and plucks the title headline text and link, then adds each link to the linkList list element.

### Next, we take all of the links harvested in the previous step, and scrape data from each individual page.

The 'with' command opens up a new csv. You'll need to modify the path to place the output csv into your own file directory. 'a', appends each new line created to a new row in the csv. 'as f' creates a variable out of the csv.

The next line uses the csv.writer function to make a new variable that writes new rows to f, out file.

writer.writerow is used to write data to the cells in a row. "Columns" in the row go between brackets and each row is separated by a comma. To write free text, just add single quotes around it.

Finally, the loop runs through the same workflow as above: First it grabs the url, then bs4 reads it, then we take the h1 tag from each of the pages and write them to a csv. 

Bonus: I added in a datetime function to time stamp each headline with a collected date, and wrote that as an additional column to the output csv.

Note: Because the open command uses 'a' for append, if you run this again it will just add new lines to your file. 'w' in place of 'a' will rewrite it each time.

In [10]:
with open('C:\\Users\\phwh9568\\Workshops\\WebScrape\\output.csv', 'a', newline = '', encoding = 'utf-8') as f: #opens an output csv
    writer = csv.writer(f) #creates a writing function
    writer.writerow(['Article Title', 'Date Collected']) #writes headers into the first row of our output csv
    for links in linkList: #iterates through each link in the linkList created above
        sources = requests.get(links).text #grabs the html behind each link
        soups = bs(sources, 'lxml') #BeautifulSoup reads each one
        articleTitles = soups.h1.text #grabs h1 text for each page
        today = str(datetime.today()) #Creates a time stamp for each run through this iterator
        writer.writerow([articleTitles, today]) #writes the h1 text for each page to a new row and adds a timestamp on each.

#### Bonus!

Import your csv to a Pandas dataframe so you can take a look at it.

In [11]:
df = pd.read_csv('C:\\Users\\phwh9568\\Workshops\\WebScrape\\output.csv', encoding = 'ISO-8859-1') #tell pandas to read a csv

#### Type df into the next cell and view a pretty version of your data.

In [12]:
df

Unnamed: 0,Article Title,Date Collected
0,Trump Says He's Not 'Happy' With Budget Deal B...,2019-02-12 12:55:07.943443
1,Trump's 'Socialism' Attack On Democrats Has It...,2019-02-12 12:55:08.303651
2,Former Attorney General Eric Holder Close To 2...,2019-02-12 12:55:08.650661
3,Trump Took Fight For Border Wall To El Paso â...,2019-02-12 12:55:08.823678
4,Trump Supporter Violently Shoves BBC Cameraman...,2019-02-12 12:55:09.096291
5,'Agreement In Principle' Reached On Border Sec...,2019-02-12 12:55:09.417298
6,If Trump Declares An Emergency To Build The Wa...,2019-02-12 12:55:09.764906
7,Rep. Ilhan Omar Apologizes 'Unequivocally' For...,2019-02-12 12:55:09.952110
8,"Days From Another Shutdown, Here's What The Ne...",2019-02-12 12:55:10.233329
9,ICE Detention Beds New Stumbling Block In Effo...,2019-02-12 12:55:10.576536


# Ta da!!