# Web Scraping with python

Web scraping is used for extraction of relevant data from web pages. 
If you require some data from a 
web page in a public domain, web scraping makes the process of data extraction quite convenient.
The use of web scraping, however, requires some basic knowledge of the structure of HTML pages. 
In this lab, you will learn the process of analyzing the HTML code of a web page and how to extract the required information from it using web scraping in Python.

# Objectives
By the end of this lab, you will be able to:

Use the requests and BeautifulSoup libraries to extract the contents of a web page

Analyze the HTML code of a webpage to find the relevant information

Extract the relevant information and save it in the required form

### libraries required

pandas library for data storage and manipulation.

BeautifulSoup library for interpreting the HTML document.

requests library to communicate with the web page.

sqlite3 for creating the database instance.

In [2]:
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup 

### Initialization of known entities
You must declare a few entities at the beginning. For example, you know the required URL, the CSV name for saving the record, the database name, and the table name for storing the record. You also know the entities to be saved. Additionally, since you require only the top 50 results, you will require a loop counter initialized to 0

In [10]:
url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'
db_name = 'Movies.db'
table_name = 'Top_50'
csv_path = 'top_50_films.csv'
df = pd.DataFrame(columns=["Average Rank","Film","Year"])
count = 0

### Loading the webpage for Webscraping
To access the required information from the web page, you first need to load the entire web page as an 
HTML document in python using the requests.get().text function and then parse the text in the HTML format using 
BeautifulSoup to enable extraction of relevant information


In [11]:
html_page = requests.get(url).text
data = BeautifulSoup(html_page, 'html.parser')

### Scraping of required information
You now need to write the loop to extract the appropriate information from the web page. 
The rows of the table needed can be accessed using the find_all() function with the BeautifulSoup object using the statements below.



In [12]:
tables = data.find_all('tbody')
rows = tables[0].find_all('tr')
#Here, the variable tables gets the body of all the tables in the web page and the variable
# rows gets all the rows of the first table.


In [13]:
for row in rows:
    if count<50:
        col = row.find_all('td')
        if len(col)!=0:
            data_dict = {"Average Rank": col[0].contents[0],
                         "Film": col[1].contents[0],
                         "Year": col[2].contents[0]}
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df,df1], ignore_index=True)
            count+=1
    else:
        break

### Iterate over the contents of the variable rows.
### Check for the loop counter to restrict to 50 entries.
### Extract all the td data objects in the row and save them to col.
### Check if the length of col is 0, that is, if there is no data in a current row. This is important since, many timesm there are merged rows that are not apparent in the web page appearance.
### Create a dictionary data_dict with the keys same as the columns of the dataframe created for recording the output earlier and corresponding values from the first three headers of data.
### Convert the dictionary to a dataframe and concatenate it with the existing one. This way, the data keeps getting appended to the dataframe with every iteration of the loop.
### Increment the loop counter.
### Once the counter hits 50, stop iterating over rows and break the loop.

In [14]:
#View the scrapped data
print(df)

   Average Rank                                           Film  Year
0             1                                  The Godfather  1972
1             2                                   Citizen Kane  1941
2             3                                     Casablanca  1942
3             4                         The Godfather, Part II  1974
4             5                            Singin' in the Rain  1952
5             6                                         Psycho  1960
6             7                                    Rear Window  1954
7             8                                 Apocalypse Now  1979
8             9                          2001: A Space Odyssey  1968
9            10                                  Seven Samurai  1954
10           11                                        Vertigo  1958
11           12                                    Sunset Blvd  1950
12           13                                   Modern Times  1936
13           14                   

In [16]:
df.to_csv(csv_path)