# Web Scraping 

**Inspecting web pages with HTML**

If we go to any page in the google and right click to select 'inspect' and it will lead to the HTML script which the web page is made of and if we click to the arrow in a box on the left side of that tool bar it helps to select individual part of the HTML page

**Beautiful Soup and Requests**

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
url = 'https://www.scrapethissite.com/pages/forms/'
# requests.get(url)
page = requests.get(url)

If we get 204 instead of <Response [200]> it means that there is content in the actual webpage, 400 means a bad request, 401, 404 means the server cannot be found.

In [None]:
# BeautifulSoup(page.text, 'html') [parsing the page in html format]
soup = BeautifulSoup(page.text, 'html')
print(soup)

In [None]:
print(soup.prettify()) # Making it visually beautiful

**Find and Find_All**

In [None]:
from bs4 import BeautifulSoup
import requests
url = 'https://www.scrapethissite.com/pages/forms/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
print(soup.prettify())

In [None]:
soup.find('div') # It will show the very first div tag details 

In [None]:
soup.find_all('div')

In [None]:
soup.find_all('div', class_ = "col-md-12") # Class is used to individualize

In [None]:
soup.find_all('p')

In [None]:
soup.find_all('p', class_ = "lead")

In [None]:
soup.find('p', class_ = "lead").text # .text doesn't work with find_all 

In [None]:
soup.find('p', class_ = "lead").text.strip()

In [None]:
soup.find_all('th')

In [None]:
soup.find('th').text.strip()

# Portfolio - Web Scraping

In [None]:
from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
print(soup)

In [None]:
# soup.find('table', class_ ="wikitable sortable")
table = soup.find_all('table')[1]
print(table)

In [None]:
# soup.find_all('th')
table_titles = table.find_all('th') # instead of soup you have to look for the table
table_titles

In [None]:
loop_titles = [titles.text.strip() for titles in table_titles]
print(loop_titles)

In [None]:
column_data = table.find_all('tr')
print(column_data)

In [None]:
import pandas as pd
df1 = pd.DataFrame(columns = loop_titles)
df1

In [None]:
for row in column_data[1:]:  # Inorder to remove the empty list to get rid of error we define it as [1:]
    row_data = row.find_all('td')
    table_rows = [rows.text.strip() for rows in row_data]

    length = len(df1)
    df1.loc[length] = table_rows

In [None]:
df1

In [None]:
df1.to_csv(r'Downloads\Companies.csv', index = False) # save as CSV files