# Web Scraping

- what is web scraping
- html basics
- beautiful soup

### What is Web-Scraping?
Web scraping is the process of using scripts or bots to extract content and data from a website.

This is especillay useful in gathering data from the internet, when web APIs aren't available.
- An API, or Application Programming Interface, is an interface that defines how different software would interact with each other. For example google maps has an API so that you don't have to web-scrape for programs to use it.

Web-Scraping primarily works by 'parsing' html code.

HTML is how web-pages are encoded. They allow for the information on a page to be displayed on any size screen when they are read and decoded by a web-browser.

In order for us to web-scrape, we first need to understand some of the basics of HTML

### HTML in a Nutshell
Let's start by looking at a basic example:

<img src="html_basic_example.png">

Each part of a page is defined by a 'tag'. Tags come in pairs, the opening tag has the form  < _ > and the closing tag has the form </ _ >

There are many differnt kinds of tags, with many different meanings, here are some of the common ones you'll run into:

| Tag  | Use |
|:---:|:---|
| < html > | The root of the page, gives the browswer a starting point |
| < head > | Contains elements describing the document |
| < body > | Defines the main body of the page, within it will be all the images, tables, lists, paragraphs, etc. |
| < p > |Paragraph, contains standard text |
| < li > | Defines list elements |
| < a > | stands for anchor, contains a link to another page. Takes the form < a href="..."> Statements... < /a> |
| < div > | Divides sections in a page, used as a container for html elements |
| < tr > | Holds row of data in a table format, used as a container for td elements |
| < td > | Holds data in a "cell" within an HTML table | 


Pages will have many tags nested in one another, and it can be intimidating to navigate at first. Fortunatly, web browswers make it easy to study the html. Open any web-site in another page, highlight anything you want to see the html source of, right-click, and select "inspect" or "inspect element"

This will bring up the html source the page is using. Running your mouse over the code will highlight what it produces on the screen

Often times elements will have 'Attributes' :

<img src="html_attribute_example.png">

The 'a' tag tells us the image of the wikipedia symbol is actually a link to the main page, with attributes of a 'title', 'class', and 'href' (or hyperlink reference, that gives the url the image links to)

We will use these attributes to search for elements within the html using a web-scraping package called 'Beautiful Soup'

### Beautiful Soup

'Beautiful Soup' is a python package for web-scraping just like 'Pandas' is a package for Data Analysis

First, lets open and look at the IMDB Top 250 Movies: https://www.imdb.com/chart/top/

Below is the code that will extract the titles of each of the listed movies, along with its release year and rating. This information will be stored in a Pandas data frame

Lets look through the code and see what each part does:

In [1]:
import requests #package to make requests over the internet in order to get the html code from desired URLs
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
#first retrieve the information from IMDB
result = requests.get("https://www.imdb.com/chart/top/")
print('status code',result.status_code) #a code of 200 means that the request was successful, 404 would mean 'not found'
src = result.content #this takes the information retrieved and stores the source code delivered

#src is raw html, we pass that into the BeautifulSoup function, that parses the information into a new data structure, similar to how we store 'csv' information in a data frame.
#the 'soup' variable will be easier for us to analyze since we can use the function provided by the package with it.
soup = BeautifulSoup(src, 'lxml')  #'lxml' tells the function the data's encoding

status code 200


Run the cell below and see the output it produces, we'll break down what it does below

In [3]:
titleColumnContainers = soup.find_all('td',{'class':'titleColumn'})
ratingColumnContainers = soup.find_all('td', {'class':'ratingColumn imdbRating'})

movieTable = pd.DataFrame(columns = ['title','date']) #creates our empty data frame
ratingTable = pd.DataFrame(columns = ['rating']) #creates additional data frame for ratings

for container in titleColumnContainers:
    date = container.span.text
    title = container.a.text
    movieTable = movieTable.append(pd.Series({'title':title,'date':date}),ignore_index=True)

for container in ratingColumnContainers:
    rating = container.strong.text
    ratingTable = ratingTable.append(pd.Series({'rating':rating}),ignore_index=True)

movieTable['rating'] = ratingTable
movieTable

Unnamed: 0,title,date,rating
0,The Shawshank Redemption,(1994),9.2
1,The Godfather,(1972),9.1
2,The Godfather: Part II,(1974),9.0
3,The Dark Knight,(2008),9.0
4,12 Angry Men,(1957),8.9
...,...,...,...
245,The Battle of Algiers,(1966),8.0
246,Nights of Cabiria,(1957),8.0
247,Andrei Rublev,(1966),8.0
248,Miracle in Cell No. 7,(2019),8.0


Let's take a closer look at the first line:
<code> titleColumnContainers = soup.find_all('td',{'class':'titleColumn'}) </code>

The 'td' argument in the find_all() function, looks for every <td> tag in the html code. Lets try that by itself and see how many 'td' tags there are:

In [4]:
len(soup.find_all('td'))

1250

The second argument <code> {'class':'titleColumn'} </code> gives more detail of what we want. This specifically tells the function to only return td tags, whos 'class' attribute is 'titleColumn'

This choice was only made by first analyzing the html in the browser. Since every title and year was contained in a 'td' tag with the same class tag, it made it the most obvious choice

<img src="html_image_1.png">

The find_all() function returns a list of 'ResultSet' objects. Each of these objects contains what was in the respective 'td' tags, one of these 'containers' is highlighted in the html above.

In the for loop, we iterate a process over each of the articles html, or 'containers' as the code names it

Beautiful Soup provides very easy syntax to access elements of 'soup' objects like container:

<code>container.a.text</code> Looks for the first 'a' tag that it finds in 'container', and returns the text within that tag.

<code>container.span.text</code> Looks for the first 'span' tag that it finds in the 'container', and returns the text within that tag. Span is used to emphasize the text.

<code>container.strong.text</code> First finds the first 'strong' tag. 'strong' makes the text bold.

<code>articles = articles.append(pd.Series({'title':title,'date':date,'link':link}),ignore_index=True)</code>
This is all 'pandas', where the information stored in the variables title, and year are made into a series before being appended to the data frame articles

# Try it yourself!

See if you can use previous code to add a column with the links to the pictures for each top movie.

Hint on approach: attrs['src'] will pull the link to the image if given the correct tag

In [None]:
#Your code here

## Extra
Using python's IPython library, we can use the column of links from above to display the movie poster. Try using pandas to insert your link instead of copying and pasting.

In [None]:
import IPython.display as Disp

#Insert movie link before the first comma
Disp.Image(     , width=150, height=200)