## Web Scraping:
It is the process of extracting data from websites by automating the retrieval and parsing of HTML (Hypertext Markup Language) code. It involves accessing web pages, retrieving the underlying HTML content, and then extracting specific information or data points from the HTML.

Web scraping allows you to gather data from websites at scale, rather than manually copying and pasting information from individual web pages. By automating the process, you can extract data from multiple pages or even entire websites in a structured and efficient manner.

## Beautiful Soup
Beautiful Soup is a Python web scraping library that allows us to parse and scrape HTML and XML pages. You can search, navigate, and modify data using a parser. It’s versatile and saves a lot of time. In this article we will learn how to scrape data using Beautiful Soup.

## Example 1

In [4]:
#importing required Libraries
import pandas as pd   #to create dataframe
import requests       #to send the request to the URL
from bs4 import BeautifulSoup #to get the content in the form of HTML
import numpy as np  # to count the values (in our case)

#assigning the URL with variable name url
url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'
#request allow you to send HTTP request
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

#creating an empty list, so that we can append the values
movie_name = []
year = []
time = []
rating = []
metascore = []
votes = []
gross = []

In [5]:
#storing the meaningfull required data in the variable
movie_data = soup.findAll('div', attrs= {'class': 'lister-item mode-advanced'})

#calling one by one using for loop
for store in movie_data:
    name = store.h3.a.text
    movie_name.append(name)

    year_of_release = store.h3.find('span', class_ = 'lister-item-year text-muted unbold').text.replace('(', '').replace(')', '')
    year.append(year_of_release)

    runtime = store.p.find('span', class_ = 'runtime').text.replace(' min', '')
    time.append(runtime)

    rate = store.find('div', class_ = 'inline-block ratings-imdb-rating').text.replace('\n', '')
    rating.append(rate)

    meta  = store.find('span', class_ = 'metascore').text.replace(' ', '') if store.find('span', class_ = 'metascore') else '^^^^^^'
    metascore.append(meta)
    #since, gross and votes have same attributes, that's why we had created a common variable and then used indexing
    value = store.find_all('span', attrs = {'name': 'nv'})

    vote = value[0].text
    votes.append(vote)

    grosses = value[1].text if len(value) >1 else '*****'
    gross.append(grosses)

In [6]:
#creating a dataframe using pandas library
movie_DF = pd.DataFrame({'Name of movie': movie_name, 'Year of release': year, 'Watchtime': time, 'Movie Rating': rating, 'Metascore': metascore, 'Votes': votes, 'Gross collection': gross})
movie_DF


#Saving data in Excel file:
movie_DF.to_excel("Top_100_IMDB_Movies.xlsx")

In [7]:
movie_DF.head()

Unnamed: 0,Name of movie,Year of release,Watchtime,Movie Rating,Metascore,Votes,Gross collection
0,The Shawshank Redemption,1994,142,9.3,82,2757982,$28.34M
1,The Godfather,1972,175,9.2,100,1919137,$134.97M
2,The Dark Knight,2008,152,9.0,84,2730947,$534.86M
3,Schindler's List,1993,195,9.0,95,1389930,$96.90M
4,12 Angry Men,1957,96,9.0,97,817118,$4.36M


In [8]:
latest = movie_DF["Year of release"].max()

movie_DF.loc[movie_DF["Year of release"]==latest,["Name of movie","Year of release","Watchtime","Movie Rating","Metascore","Votes","Gross collection"]]

Unnamed: 0,Name of movie,Year of release,Watchtime,Movie Rating,Metascore,Votes,Gross collection
58,96,II 2018,158,8.5,^^^^^^,33352,*****


### Example 2

In [9]:
import requests
#the website URL
url_link = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States"
result = requests.get(url_link).text
#print(result)

In [23]:
from bs4 import BeautifulSoup
#import requests library
import requests
#the website URL
url_link = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States"
result = requests.get(url_link).text
doc = BeautifulSoup(result, "html.parser")

#print(doc.prettify())

In [24]:
res = doc.find(id = "content")
#print(res)

#### FIND ELEMENTS BY CLASS NAME:

In [25]:
heading = res.find(class_ = "firstHeading")
print(heading)

<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">List of states and territories of the United States</span></h1>


#### EXTRACTING TEXT FROM HTML ELEMENTS

In [26]:
print(heading.text)

List of states and territories of the United States


In [27]:
url_link="https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States"

result=requests.get(url_link).text

doc=BeautifulSoup(result, "html.parser")

In [28]:
my_table=doc.find("table", class_="wikitable sortable plainrowheaders")

In [29]:
th_tags = my_table.find_all('th')
names = []
for elem in th_tags:
    a_links = elem.find_all("a")
    # Getting the text inside the <a> tag
    for i in a_links:
        names.append(i.string)
print(names)

['postal abbreviation', '[8]', '[A]', '[10]', '[11]', '[11]', '[11]', None, '[12]', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', '[B]', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', '[B]', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', '[B]', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', '[B]', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']


In [30]:
final_list = names[9:]
states = []
print(final_list)  # Debug print statement
for string in final_list:
    print(f"Checking string: '{string}'")  # Debug print statement
    if len(string.lower()) > 3:  # Convert to lowercase for comparison
        states.append(string)
print(states)

['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', '[B]', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', '[B]', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', '[B]', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', '[B]', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
Checking string: 'Alabama'
Checking string: 'Alaska'
Checking string: 'Arizona'
Checking string: 'Arkansas'
Checking string: 'California'
Checking string: 'Colorado'
Checking string: 'Connecticut'
Checking string: 'Delaware'
Checking string: 'Florida'
Checking string: 'Georgia'
Checking string: 'Hawaii'
Checking string: 'Idaho'
Checking string: 'I

In [31]:
divs = my_table.find_all("div")
pop = []
for i in divs:
    pop.append(i.string)
print(pop)

['5,024,279', '7', '733,391', '1', '7,151,502', '9', '3,011,524', '4', '39,538,223', '52', '5,773,714', '8', '3,605,944', '5', '989,948', '1', '21,538,187', '28', '10,711,908', '14', '1,455,271', '2', '1,839,106', '2', '12,812,508', '17', '6,785,528', '9', '3,190,369', '4', '2,937,880', '4', '4,505,836', '6', '4,657,757', '6', '1,362,359', '2', '6,177,224', '8', '7,029,917', '9', '10,077,331', '13', '5,706,494', '8', '2,961,279', '4', '6,154,913', '8', '1,084,225', '2', '1,961,504', '3', '3,104,614', '4', '1,377,529', '2', '9,288,994', '12', '2,117,522', '3', '20,201,249', '26', '10,439,388', '14', '779,094', '1', '11,799,448', '15', '3,959,353', '5', '4,237,256', '6', '13,002,700', '17', '1,097,379', '2', '5,118,425', '7', '886,667', '1', '6,910,840', '9', '29,145,505', '38', '3,271,616', '4', '643,077', '1', '8,631,393', '11', '7,705,281', '10', '1,793,716', '2', '5,893,718', '8', '576,851', '1']


In [32]:
pop_final = []
for i in pop:
    if len(i) > 3:
        pop_final.append(i)
print(pop_final)

['5,024,279', '733,391', '7,151,502', '3,011,524', '39,538,223', '5,773,714', '3,605,944', '989,948', '21,538,187', '10,711,908', '1,455,271', '1,839,106', '12,812,508', '6,785,528', '3,190,369', '2,937,880', '4,505,836', '4,657,757', '1,362,359', '6,177,224', '7,029,917', '10,077,331', '5,706,494', '2,961,279', '6,154,913', '1,084,225', '1,961,504', '3,104,614', '1,377,529', '9,288,994', '2,117,522', '20,201,249', '10,439,388', '779,094', '11,799,448', '3,959,353', '4,237,256', '13,002,700', '1,097,379', '5,118,425', '886,667', '6,910,840', '29,145,505', '3,271,616', '643,077', '8,631,393', '7,705,281', '1,793,716', '5,893,718', '576,851']


In [33]:
import pandas as pd

df = pd.DataFrame()

df['state'] = states
df['population'] = pop_final

print(df)

             state  population
0          Alabama   5,024,279
1           Alaska     733,391
2          Arizona   7,151,502
3         Arkansas   3,011,524
4       California  39,538,223
5         Colorado   5,773,714
6      Connecticut   3,605,944
7         Delaware     989,948
8          Florida  21,538,187
9          Georgia  10,711,908
10          Hawaii   1,455,271
11           Idaho   1,839,106
12        Illinois  12,812,508
13         Indiana   6,785,528
14            Iowa   3,190,369
15          Kansas   2,937,880
16        Kentucky   4,505,836
17       Louisiana   4,657,757
18           Maine   1,362,359
19        Maryland   6,177,224
20   Massachusetts   7,029,917
21        Michigan  10,077,331
22       Minnesota   5,706,494
23     Mississippi   2,961,279
24        Missouri   6,154,913
25         Montana   1,084,225
26        Nebraska   1,961,504
27          Nevada   3,104,614
28   New Hampshire   1,377,529
29      New Jersey   9,288,994
30      New Mexico   2,117,522
31      

## Example 3

In [34]:
# pip3 install requests
import requests

# pip3 install beautifulsoup4
from bs4 import BeautifulSoup

# pip3 install pandas
import pandas as pd

books = []

for i in range(1,5):
    url = f"https://books.toscrape.com/catalogue/page-{i}.html"
    response = requests.get(url)
    response = response.content
    soup = BeautifulSoup(response, 'html.parser')
    ol = soup.find('ol')
    articles = ol.find_all('article', class_='product_pod')
for article in articles:
    image = article.find('img')
    title = image.attrs['alt']
    starTag = article.find('p')
    star = starTag['class'][1]
    price = article.find('p', class_='price_color').text
    price = float(price[1:])
    books.append([title, star, price])
    

df = pd.DataFrame(books, columns=['Title', 'Star Rating', 'Price'])
df.to_csv('books.csv')