# Introduction to Webscraping, 4-9-20, CRDDS Workshop Series
## Presented by Nickoal Eichmann-Kalwara & Phil White

### The following Jupyter Notebook was the basis for the Python/Beautiful Soup demonstration.

Contact Phil White with questions: [philip.white@colorado.edu](mailto:philip.white@colorado.edu)

Dependencies: if you're using Python, you will need to ensure requests and bs4 are installed. csv and datetime come standard. The easiest way to install these packages is to open your command prompt and type 'pip install bs4' and 'pip install requests'

#### In this example, we'll scrape the html of today's NPR Politics section. View the source of the [NPR Politics page here](https://www.npr.org/sections/politics/)

### Basic workflow

1. Use requests.get() to call a url, and append the .text to the command to bring in the html of the page.

2. bs(source, 'lxml') tells BeautifulSoup to read/parse the html

3. Identify the tags you want to extract data from. It is helpful to go to the page and use right-click inspect element to figure out where the item you want is nested in the HTML structure.

4. In this example, soup.h2.text grabs all of the text from in between h2 tags.

5. soup.h2.a['href'] grabs the hyperlinks within an anchor tags 

6. We then create variables out of each line of code, naming them 'source' (the source data), 'soup' (the list that BeautifulSoup can read), and then 'headline' (first h2 tag text) and 'link' (first link within the first h2 tag)

7. Finally, we printed them both out.

    '>>>'  represents the Python prompt. 

Type all the code you find after prompts in the next cell and execute with a shift+enter

#### Import necessary Python libraries:
We'll use BeautifulSoup to parse html, requests to retrieve webpages, csv to write output files, datetime to timestamp our data, and pandas to view our data table.

Type:

    >>> from bs4 import BeautifulSoup as bs
    >>> import requests
    >>> import csv
    >>> from datetime import datetime
    >>> import pandas as pd

In this example, we're only grabbing the first h2 and a tag from the html doc.

Start by requesting the webpage:

    >>> source = requests.get('https://www.npr.org/sections/politics/').text

Make it a list that BeautifulSoup can interpret:

    >>> soup = bs(source, 'lxml')

Grab the text within the first h2 tag

    >>> headline = soup.h2.text

Grab the link within the first h2 tag:

    >>> link = soup.h2.a['href']

Print them both:

    >>> print(headline)
    >>> print(link)

### Next, instead of grabbing the first h2 tag, we use 'find_all' to grab all of the h2 tags.

1. find_all gets all h2 tags. In this example, we filtered by class using class_ = 'title'. This got ride of h2 tags that contained info we don't want. We made a variable out of this called 'headlines.'
2. Then, list() was used to create an empty list. We made the list a variable called linkList
3. Next, we create a 'for loop' to iterate over each item in the headlines list. Within the loop, it finds each h2 and grabes the titles and the links (same general structure as above)
4. Finally, the loop uses an append command to add each link into a new list.

First, grab the text from each h2 tag classed as 'title':

    >>> headlines = soup.find_all('h2', class_ = 'title')

Next, we'll make a list of page links that we can revisit. Start by making an empty list:

    >>> linkList = list()

Now, we'll create a for loop. The loop iterates over each h2 in the headlines list and plucks the title headline text and link, then adds each link to the linkList list element:

    >>> for items in headlines:
            titles = items.text
            links = items.a['href']
            linkList.append(links)

### Next, we take all of the links harvested in the previous step, and scrape data from each individual page.

The 'with' command opens up a new csv. You'll need to modify the path to place the output csv into your own file directory. 'a', appends each new line created to a new row in the csv. 'as f' creates a variable out of the csv.

The next line uses the csv.writer function to make a new variable that writes new rows to f, out file.

writer.writerow is used to write data to the cells in a row. "Columns" in the row go between brackets and each row is separated by a comma. To write free text, just add single quotes around it.

Finally, the loop runs through the same workflow as above: First it grabs the url, then bs4 reads it, then we take the h1 tag from each of the pages and write them to a csv. 

Bonus: I added in a datetime function to time stamp each headline with a collected date, and wrote that as an additional column to the output csv.

Note: Because the open command uses 'a' for append, if you run this again it will just add new lines to your file. 'w' in place of 'a' will rewrite it each time.

    >>> with open('C:\\Users\\phwh9568\\Workshops\\WebScrape\\output.csv', 'a', newline = '', encoding = 'utf-8') as f: 
        writer = csv.writer(f) 
        writer.writerow(['Article Title', 'Date Collected']) 
        for links in linkList: 
            sources = requests.get(links).text 
            soups = bs(sources, 'lxml') 
            articleTitles = soups.h1.text 
            today = str(datetime.today()) 
            writer.writerow([articleTitles, today]) 

#### Here is the above code with comments to help understand what each line is accomplishing

In [10]:
with open('C:\\Users\\phwh9568\\Workshops\\WebScrape\\output.csv', 'a', newline = '', encoding = 'utf-8') as f: #opens an output csv
    writer = csv.writer(f) #creates a writing function
    writer.writerow(['Article Title', 'Date Collected']) #writes headers into the first row of our output csv
    for links in linkList: #iterates through each link in the linkList created above
        sources = requests.get(links).text #grabs the html behind each link
        soups = bs(sources, 'lxml') #BeautifulSoup reads each one
        articleTitles = soups.h1.text #grabs h1 text for each page
        today = str(datetime.today()) #Creates a time stamp for each run through this iterator
        writer.writerow([articleTitles, today]) #writes the h1 text for each page to a new row and adds a timestamp on each.

#### Bonus!

Import your csv to a Pandas dataframe so you can take a look at it.

    >>> df = pd.read_csv('C:\\Users\\phwh9568\\Workshops\\WebScrape\\output.csv', encoding = 'ISO-8859-1')

Type df into the next cell and view a pretty version of your data.

    >>> df

# Ta da!!