# Web Scraping II

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from bs4 import BeautifulSoup
from lxml import html 
from requests import get
import pandas as pd
import numpy as np

## Notes and Key Things to Remember

**Make sure you specify your URL correctly!** This holds for both APIs and for web scraping. One of the most common mistakes when using these techniques is that the URL just hasn't been specified correctly, and there's just an error in the request. Steps to take include:
- Check your status code, and if you are not getting a 200 status code, try to figure out what went wrong with your URL. 
- Look at the documentation for APIs and see if you can identify what the base URL you should be using is. Compare the base URL to your base URL to see if it matches.
- Navigate to the URL in a browser to see if you can connect to the website and/or see the data.

**Identify the exact pieces of data you want to collect.** A webpage will have lots of information on it, and sifting through it all can be daunting. 
- Use the Selector Gadget to make your life easier. Use the XPath or the `select` method with Beautiful Soup.
- Try a few different methods to see what you get. Sometimes, just identifying the correct tag might be easy and useful for grabbing the data you want. Other times, the XPath will be the easiest way.

**Cleaning the data is an exercise in Python.** That is, understanding the different Python objects such as lists, dictionaries, and DataFrames is very important. The data that you get will be in sorts of formats (e.g., list of lists, list of dictionaries, just one big list, etc.), so you will need to think about how to work with data in all shapes and sizes.
- To begin, identify what the data structure is like in the beginning.
- Think about what you want to end up with. This is usually a DataFrame, but what should each row in the DataFrame represent?
- Think about what types of data you have. Do you have numeric data? Or do you have strings? Should you do any conversion while you work with your data?

## Beautiful Soup

In this notebook, we will continue to work with Beautiful Soup (https://beautiful-soup-4.readthedocs.io/en/latest/), a Python library that is designed to make pulling data out of HTML files easier. 

As a reminder, the steps for web scraping with Beautiful Soup are:
- Get the content using the url of the website you want to scrape and `get` from `requests`.
- Parse the content of the website using Beautiful Soup and create a Beautiful Soup object. (This creates a data structure that we can navigate to extract information from)
- Look at the source code or use Selector Gadget to identify what you want and where it is within the HTML.
- Use tags and methods such as `find_all` or `select` to get the pieces of information you want from the webpage.


## Beautiful Soup with Selector Gadget

So far, we've gone over using the xpath from Selector Gadget, as well as using Beautiful Soup and trying to identify the tags that are associated with each piece of data we want to collect. We can actually also use these together. The Selector Gadget tool gives us what we need to use within the `select` method to grab just the highlighted information. 

Let's give this a shot with some data from Wikipedia.The first step, as before, is to get the URL of the website we want to scrape. We will scrape data from the University of Maryland Wikipedia page. This page has some basic information about the university, including tables of student composition and admittance information.

In [None]:
url = 'https://en.wikipedia.org/wiki/University_of_Maryland,_College_Park'

We can navigate to the URL to take a look at it. Try opening up Selector Gadget with the website.

In [None]:
url

As before, we use `get` to get a response from the website. As long as you don't have a typo in the URL, you should get a 200 status code.

In [None]:
webpage = get(url)
webpage.status_code

Then, we use `BeautifulSoup` to get the webpage content into the Beautiful Soup data structure. This provides the organization that we can then use to extract the information we want.

In [None]:
soup = BeautifulSoup(webpage.content, 'html.parser')

Once we have the parsed HTML information, we can use the `.select` method with our Beautiful Soup object to grab the pieces that we want using the help of Selector Gadget. For example, to grab some information from tables within the Wikipedia article, we can click around until we have the information we want highlighted, then paste the string that shows up at the bottom (NOT the XPath!) as the argument. 

Let's try to grab the table of admissions information from the Wikipedia page. This should look something like the following:

![UMD Admissions Table](umd_table.png)

To grab the information from this table, we'll use Selector Gadget to try to select just the information in the table, then use that with the `select` method to hopefully isolate the data we want. Note that it might be very hard to grab only the piece we want, so we might have to deal with getting a bit more information. 

In the image below you can see the table highlighted. The stuff at the bottom in the Selector Gadget tool that says `".wikitable td, .wikitable th"` is what we want to copy and paste and use with the `select` method.

![Selector Gadget with the UMD Admissions Table](SelectorGadgetWikipedia.png)

In [None]:
table_info = soup.select('.wikitable td , .wikitable th')
table_info[:10]

Note that we have some stuff that we don't necessarily want to include.

In [None]:
table_info[-10:]

The information we wanted to get is in a table, so we'll eventually want to get it into a DataFrame format. With this in mind, we'll have to do some work to get the text that's in the tables in a format that is conducive to analysis.

<font color ='red'>**Question 1: Create a list that contains the text of just the table information for University of Maryland applicants (so, not the table information for any other tables that were brought in). Use the `.strip` method to remove any leading or trailing spaces and/or carriage returns.**</font>

Now that we have the information in a list, we need to do some transformation in order to make it into a DataFrame. 

<font color ='red'>**Question 2: Using list comprehension and dictionary comprehension, create a dictionary that has the variable names as keys (Applicants, Admits, Admit rate, etc.) and the values for each year between 2017 and 2022 as the values. Note that you won't have a "Year" key since they didn't include "Year" within the table originally. Using this dictionary, create a DataFrame called `umd_data`.**</font>

<font color ='red'>**Question 3: Using `apply`, remove all commas from the strings of numbers that have commas. For example, if a value is "56,766", it should be updated so that it is just "56766". Then, convert all numeric variables into numeric.**</font>

Let's do a little bit more cleaning by renaming the blank column to `Year`.

In [None]:
umd_data.rename(columns = {'':'Year'}, inplace = True)
umd_data.head()

<font color ='red'>**Question 4: Draw a line graph plotting the Admittance rate and Yield rate over the years.**</font>

*Hint:* You can use `.plot.line` with the argument `x=` for specifying the x-axis.

<font color ='red'>**Question 5: Create two new variables: `sat_lower` and `sat_upper`, which represent the lower and upper bounds for the middle 50% of students who submitted SAT scores. Make sure that these variables numeric (rather than string). Create a line plot of these.**</font>

## Extracting Other Information

If we wanted to grab the links in a webpage, we can do that as well. This is done by identifying where the link is, then accessing the `href` content. For example, let's get a quick example by searching for all `a` tags and identifying any with `href` in them.

In [None]:
soup.find_all('a')[:6]

In [None]:
soup.find_all('a')[6]

Once we've isolated it, we can simply use `['href']` in order to get that content. Note that these are usually relative links within Wikipedia, so you'll have to do some more work to get a usable URL.

In [None]:
soup.find_all('a')[6]['href']

> Hint: This is something that might be useful for grabbing lots of links and looping through them. For example, if you need to loop through lots of counties and navigate to their pages, you can get a list of URLs by grabbing the href content.

## Navigating the Tree

Let's go back to the original selection we had done earlier with `table_info`. Note that there is a link, but because of the formatting included within the table, we can't access it directly.

In [None]:
table_info[65]

So, we need to essentially keep going and navigate the tree in order to access it. We'll use `find` to grab the `a` tag, then use that to grab the URL information.

In [None]:
table_info[59].find('a')

In [None]:
table_info[59].find('a')['href']

Note that again, this is a relative path, so we need to add the Wikipedia URL to make it complete.

In [None]:
'https://en.wikipedia.org' + table_info[59].find('a')['href']

## More Scraping 

Let's do another example of scraping from the University of Maryland Wikipedia page. Let's grab the information that is on the side box containing some quick facts about the university. 

<img src="umd_facts.png" width="200">

<font color ='red'>**Question 6: Using `select`, grab the information in the box (part of the box is shown above). This should have stuff like "Former names", "Motto", "Motto in English", and so on.len(side_info)**</font>

<font color ='red'>**Question 7: Create a dictionary that contains the information, with the keys representing the left-hand column ("Former names", "Motto", etc.) and the values representing the  values for Maryland.**</font>

<font color ='red'>**Question 8: Extract the link to the page for "Public land-grant research university" from the box.**</font>