## Basic components of a WebSite

### HTML
HTML stands for  Hypertext Markup Language and every website on the internet uses it to display information. Even the jupyter notebook system uses it to display this information in your browser. If you right click on a website and select "View Page Source" you can see the raw HTML of a web page. This is the information that Python will be looking at to grab information from. Let's take a look at a simple webpage's HTML:

    <!DOCTYPE html>  
    <html>  
        <head>
            <title>Title on Browser Tab</title>
        </head>
        <body>
            <h1> Website Header </h1>
            <p> Some Paragraph </p>
        <body>
    </html>

https://www.example.com/

https://www.w3schools.com/html/default.asp

### CSS

CSS stands for Cascading Style Sheets, this is what gives "style" to a website, including colors and fonts, and even some animations! CSS uses tags such as **id** or **class** to connect an HTML element to a CSS feature, such as a particular color. **id** is a unique id for an HTML tag and must be unique within the HTML document, basically a single use connection. **class** defines a general style that can then be linked to multiple HTML tags. Basically if you only want a single html tag to be red, you would use an id tag, if you wanted several HTML tags/blocks to be red, you would create a class in your CSS doc and then link it to the rest of these blocks.

## Web Scraping with Python


    


### Grabbing the title of a page

Let's start very simple, we will grab the title of a page. Remember that this is the HTML block with the **title** tag. For this task we will use **www.example.com** which is a website specifically made to serve as an example domain. Let's go through the main steps:

In [None]:
import requests

In [None]:
# Step 1: Use the requests library to grab the page
# Note, this may fail if you have a firewall blocking Python/Jupyter
# Note sometimes you need to run this twice if it fails the first time
res = requests.get("http://www.example.com")

The **requests library** in Python is a powerful and easy-to-use HTTP library for making web requests. It allows you to send HTTP/1.1 requests (GET, POST, PUT, DELETE, etc.) and handle responses from web servers.

This object is a requests.models.Response object and it actually contains the information from the website, for example:

In [None]:
type(res)

In [None]:
res.text

____
Now we use BeautifulSoup to analyze the extracted page. Technically we could use our own custom script to loook for items in the string of **res.text** but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file). Using BeautifulSoup we can create a "soup" object that contains all the "ingredients" of the webpage. Don't ask me about the weird library names, I didn't choose them! :)

In [None]:
import bs4

In [None]:
soup = bs4.BeautifulSoup(res.text,"lxml")

In [None]:
soup

Now let's use the **.select()** method to grab elements. We are looking for the 'title' tag, so we will pass in 'title'


In [None]:
soup.select('title')

Notice what is returned here, its actually a list containing all the title elements (along with their tags). You can use indexing or even looping to grab the elements from the list. Since this object it still a specialized tag, we cna use method calls to grab just the text.

In [None]:
title_tag = soup.select('title')

In [None]:
title_tag[0]

In [None]:
type(title_tag[0])

In [None]:
title_tag[0].getText()

### Grabbing all elements of a class

Let's try to grab all the section headings of the Wikipedia Article on Grace Hopper from this URL: https://en.wikipedia.org/wiki/Grace_Hopper

In [None]:
# First get the request
res = requests.get('https://en.wikipedia.org/wiki/Grace_Hopper')

In [None]:
res.text

In [None]:
# Create a soup from request
soup = bs4.BeautifulSoup(res.text,"lxml")

Now its time to figure out what we are actually looking for. Inspect the element on the page to see that the section headers have the class "mw-headline". Because this is a class and not a straight tag, we need to adhere to some syntax for CSS. In this case

<table>

<thead >
<tr>
<th>
<p>Syntax to pass to the .select() method</p>
</th>
<th>
<p>Match Results</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<p><code>soup.select('div')</code></p>
</td>
<td>
<p>All elements with the <code>&lt;div&gt;</code> tag</p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('#some_id')</code></p>
</td>
<td>
<p>The HTML element containing the <code>id</code> attribute of <code>some_id</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('.notice')</code></p>
</td>
<td>
<p>All the HTML elements with the CSS <code>class</code> named <code>notice</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div span')</code></p>
</td>
<td>
<p>Any elements named <code>&lt;span&gt;</code> that are within an element named <code>&lt;div&gt;</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div &gt; span')</code></p>
</td>
<td>
<p>Any elements named <code class="literal2">&lt;span&gt;</code> that are <span><em >directly</em></span> within an element named <code class="literal2">&lt;div&gt;</code>, with no other element in between</p>
</td>
</tr>
<tr>

</tr>
</tbody>
</table>

In [None]:
soup.select("p")

In [None]:
# note depending on your IP Address,
# this class may be called something different
soup.select(".reference")
# soup.select(".cite-bracket")

In [None]:
for item in soup.select(".reference"):
    print(item.text)

### Getting an Image from a Website


You can make dictionary like calls for parts of the Tag, in this case, we are interested in the **src** , or "source" of the image, which should be its own .jpg or .png link:

We can actually display it with a markdown cell with the following:

    <img src='https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/220px-Deep_Blue.jpg'>

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/220px-Deep_Blue.jpg'>

Now that you have the actual src link, you can grab the image with requests and get along with the .content attribute. Note how we had to add https:// before the link, if you don't do this, requests will complain (but it gives you a pretty descriptive error code).

https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/220px-Deep_Blue.jpg

In [None]:
image_link = requests.get('https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/220px-Deep_Blue.jpg')

In [None]:
# The raw content (its a binary file, meaning we will need to use binary read/write methods for saving it)
image_link.content
# it's a long stream of characters that represent the image in byte format (such as b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR...), which is not easily readable as plain text.

**Let's write this to a file:=, not the 'wb' call to denote a binary writing of the file.**

In [None]:
f = open('my_new_file_name.jpg','wb')

In [None]:
f.write(image_link.content)

In [None]:
f.close()

Now we can display this file right here in the notebook as markdown using:

    <img src="'my_new_file_name.jpg'>
    
Just write the above line in a new markdown cell and it will display the image we just downloaded!

## scraping images

https://www.google.com/search?q=flowers&sxsrf=ALeKk00uvzQYZFJo03cukIcMS-pcmmbuRQ:1589501547816&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjEm4LZyrTpAhWjhHIEHewPD1MQ_AUoAXoECBAQAw&biw=1440&bih=740

In [None]:
# import re # regular expressions, provides tools for matching, searching, and manipulating strings based on patterns
import requests
import os
from bs4 import BeautifulSoup


In [None]:
f = open("images_flowers.txt", "w")
res=[]
def download_google(url):
    #url = 'https://www.google.com/search?q=flowers&sxsrf=ALeKk00uvzQYZFJo03cukIcMS-pcmmbuRQ:1589501547816&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjEm4LZyrTpAhWjhHIEHewPD1MQ_AUoAXoECBAQAw&biw=1440&bih=740'
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')

    for raw_img in soup.find_all('img'):
        link = raw_img.get('src')
        res.append(link)
        if link:
            f.write(link +"\n")


download_google('https://www.google.com/search?q=flowers&sxsrf=ALeKk00uvzQYZFJo03cukIcMS-pcmmbuRQ:1589501547816&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjEm4LZyrTpAhWjhHIEHewPD1MQ_AUoAXoECBAQAw&biw=1440&bih=740')

f.close()


### Save the urls to images in a folder

In [None]:
# change the output_folder if you want to !!
output_folder = "/content/flowers"
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Open a file to store image links
f = open("images_flowers.txt", "w")
res = []

def download_google(url):
    # Fetch the webpage content
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')

    # Loop through all image elements
    for index, raw_img in enumerate(soup.find_all('img')):
        link = raw_img.get('src')
        if link:
            res.append(link)
            f.write(link + "\n")
            try:
                # Fetch image content
                img_data = requests.get(link).content
                # Save image as .jpg in the specified folder
                file_path = os.path.join(output_folder, f"image_{index + 1}.jpg")
                with open(file_path, 'wb') as img_file:
                    img_file.write(img_data)
                print(f"Saved: {file_path}")
            except Exception as e:
                print(f"Failed to download image {index + 1}: {e}")

# Call the function with the Google search URL
download_google('https://www.google.com/search?q=flowers&sxsrf=ALeKk00uvzQYZFJo03cukIcMS-pcmmbuRQ:1589501547816&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjEm4LZyrTpAhWjhHIEHewPD1MQ_AUoAXoECBAQAw&biw=1440&bih=740')

# Close the file
f.close()


# Scraping Data from a Real Website + Pandas

The **requests library** in Python is a powerful and easy-to-use HTTP library for making web requests. It allows you to send HTTP/1.1 requests (GET, POST, PUT, DELETE, etc.) and handle responses from web servers.

https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue'

In [None]:
page = requests.get(url)

In [None]:
page.text

In [None]:
type(page.text)

BeautifulSoup is a Python library used for web scraping and parsing HTML or XML documents. It creates a parse tree that makes it easy to extract and manipulate data from web pages. It works with parsers like html.parser, lxml, or html5lib

In [None]:
soup = BeautifulSoup(page.text, 'html')

In [None]:
print(soup)

In [None]:
soup.find('table')
# The code soup.find('table') searches for the first <table> element in the HTML content parsed by BeautifulSoup.

In [None]:
len(soup.find_all('table'))

In [None]:
type(soup.find_all('table'))

In [None]:
soup.find_all('table')[1]

In [None]:
soup.find('table', class_ = 'wikitable sortable')

In [None]:
table = soup.find_all('table')[1]

In [None]:
print(table)

In [None]:
world_titles = table.find_all('th')

In [None]:
world_titles

In [None]:
world_table_titles = [title.text.strip() for title in world_titles]

print(world_table_titles)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(columns = world_table_titles)

df

In [None]:
column_data = table.find_all('tr')

In [None]:
for row in column_data[1:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]

    length = len(df)
    df.loc[length] = individual_row_data

In [None]:
df

Unnamed: 0,Rank,Name,Industry,Revenue (USD billions),Employees,Headquarters
0,1,Cargill,Food industry,177.0,160000,"Minnetonka, Minnesota"
1,2,Koch Industries,Conglomerate,125.0,120000,"Wichita, Kansas"
2,3,Publix Super Markets,Retail,54.5,250000,"Lakeland, Florida"
3,4,"Mars, Incorporated",Food industry,47.0,140000,"McLean, Virginia"
4,5,H-E-B,Retail,43.6,145000,"San Antonio, Texas"
5,6,Reyes Holdings,Wholesaling,40.0,36000,"Rosemont, Illinois"
6,7,Enterprise Holdings,Car rental,35.0,90000,"Clayton, Missouri"
7,8,C&S Wholesale Grocers,Wholesaling,34.7,15000,"Keene, New Hampshire"
8,9,Love's,Petroleum industry and Retail,26.5,40000,"Oklahoma City, Oklahoma"
9,10,Southern Glazer's Wine and Spirits,Food industry,26.0,24000,"Miramar, Florida"


In [None]:
df.to_csv('Companies.csv', index = False)