# Web Scraping and Extracting Data using REST APIs

### HTML structure
Hypertext markup language (HTML) serves as the foundation of web pages. Understanding its structure is crucial for web scraping.<br>

* `<html>` is the root element of an HTML page.<br>
* `<head>` contains meta-information about the HTML page.<br>
* `<body>` displays the content on the web page, often the data of interest.<br>
* `<h3>` tags are type 3 headings, making text larger and bold, typically used for player names.<br>
* `<p>` tags represent paragraphs and contain player salary information.<br>


### Composition of an HTML tag
HTML tags define the structure of web content and can contain attributes.<br>

* An HTML tag consists of an opening (start) tag and a closing (end) tag.<br>
* Tags have names (`<a>` for an anchor tag).<br>
* Tags may contain attributes with an attribute name and value, providing additional information to the tag.<br>


### HTML document tree
You can visualize HTML documents as trees with tags as nodes.<br>

* Tags can contain strings and other tags, making them the tag's children.<br>
* Tags within the same parent tag are considered siblings.<br>
* For example, the `<html>` tag contains both `<head>` and `<body>` tags, making them descendants of `<html>` but children of `<html>`. `<head>` and `<body>` are siblings.<br>

&emsp;&emsp;<img src="../Pictures/DOM_structure.png"/>


### HTML tables
HTML tables are essential for presenting structured data.<br>

* Define an HTML table using the `<table>` tag.<br>
* Each table row is defined with a `<tr>` tag.<br>
* The first row often uses the table header tag, typically `<th>`.<br>
* The table cell is represented by `<td>` tags, defining individual cells in a row.<br>


&emsp;&emsp;<img src="../Pictures/HTML Tables.png"/>



### Uniform resource locator (URL)

Uniform resource locator (URL) is the most popular way to find resources on the web.  We can break the URL into three parts.

<ul>
    <li><b>Scheme</b>:- This is this protocol, for this lab it will always be <code>http://</code>  </li>
    <li><b> Internet address or  Base URL </b>:- This will be used to find the location here are some examples: <code>www.ibm.com</code> and  <code> www.gitlab.com </code> </li>
    <li><b>Route</b>:- Location on the web server for example: <code>/images/IDSNlogo.png</code> </li>
</ul>


### HTTP Methods

<ul>
    <li><b>Get</b> retrieves data from the server </li>
    <li><b>Post</b> submits data to server </li>
    <li><b>Put</b> updates data already on server </li>
    <li><b>Delete</b> deletes data from server </li>
</ul>

### Packages
`BeautifulSoup` library for interpreting the `HTML` document. Beautiful represents HTML as a set of Tree like objects with methods used to parse the HTML<br>

`requests` library to communicate with the web page. It allows you to send <code>HTTP/1.1</code> requests easily.<br>

`sqlite3` for creating the database instance<br>

In [5]:
# Requests Library Demo

import requests
import os
# from PIL import Image
# from IPython.display import IFrame

url='https://www.ibm.com/'
r=requests.get(url)
print(r.status_code)
print(r.status_code)


200


In [2]:
# Installing packages
!pip3 install beautifulsoup4
!pip3 install requests
# Check all installed packages and version
!pip3 freeze

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Downloading beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.9/147.9 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.3 soupsieve-2.5
Collecting requests
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting charset-normalizer<4,>=2 (from requests)
  Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests)
  Downloading idna-3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Downloading urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB)
Downlo

### Lab Scenario
Consider that you have been hired by a Multiplex management organization to extract the information of the top 50 movies with the best average rating from the web link shared below.<br>
https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films<br>

The information required is `Average Rank`, `Film`, and `Year`.<br><br>
You are required to write a Python script `webscraping_movies.py` that extracts the information and saves it to a `CSV` file `top_50_films.csv`. You are also required to save the same information to a database `Movies.db` under the table name `Top_50`.


In [1]:
# webscraping_movies.py

import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup
import os 

url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'
db_name = 'Movies.db'
table_name = 'Top_50'
csv_path = os.path.join(os.getcwd(),'top_50_films.csv')
df = pd.DataFrame(columns=["Average Rank","Film","Year"])
count = 0

In [3]:
# Loading the webpage for Webscrapping
html_page = requests.get(url).text
data = BeautifulSoup(html_page, 'html.parser')

In [4]:
tables = data.find_all('tbody')
rows = tables[0].find_all('tr')

In [5]:
for row in rows:
    if count<50:
        col = row.find_all('td')
        if len(col)!=0:
            data_dict = {"Average Rank": col[0].contents[0],
                         "Film": col[1].contents[0],
                         "Year": col[2].contents[0]}
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df,df1], ignore_index=True)
            count+=1
    else:
        break
print(df)

   Average Rank                                           Film  Year
0             1                                  The Godfather  1972
1             2                                   Citizen Kane  1941
2             3                                     Casablanca  1942
3             4                         The Godfather, Part II  1974
4             5                            Singin' in the Rain  1952
5             6                                         Psycho  1960
6             7                                    Rear Window  1954
7             8                                 Apocalypse Now  1979
8             9                          2001: A Space Odyssey  1968
9            10                                  Seven Samurai  1954
10           11                                        Vertigo  1958
11           12                                    Sunset Blvd  1950
12           13                                   Modern Times  1936
13           14                   

In [6]:
df.to_csv(csv_path)

conn = sqlite3.connect(db_name)
df.to_sql(table_name, conn, if_exists='replace', index=False)
conn.close()