Objective: Find and extract data elements on a webpage and store the data into a dataset.

# Theory

1. We will need to install packages and libraries for `webdriver`
2. We will then need to review the webpage's `HTML` structure and determine the path to obtain the data we want to scrape.
3. Next, we will start scraping the data using Selenium and `pandas`.
4. Lastly, we will output the processed data into a `csv` file.

The `ipynb` below is run on [Google Colab](https://colab.research.google.com/) as the IDE contains multiple built-in libraries such as `pandas` and many more. It also runs on LinuxOS, which allows more functions that a WinOS user cannot work with.

[Link to the Colab Notebook](https://colab.research.google.com/drive/1BepAnTQLjTfFHE4dRd84Wg9j1t_onEnc?usp=sharing)

## Installing Package and Libraries for Web Driver

In [None]:
# Install the package.
!pip install selenium

# Import the required libraries.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

import pandas as pd

# Install the chrome web driver from selenium. 
!apt-get update 
!apt install chromium-chromedriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver', chrome_options = chrome_options)

The web driver is a key component of selenium. The web driver is a browser automation framework that works with open source APIs. The framework operates by accepting commands, sending those commands to a browser, and interacting with applications.

Selenium supports multiple web browsers and offers web drivers for each browser. I have imported the chrome web driver from selenium. Alternatively, you can download the web driver for your specific browser and store it in a location where it can be easily accessed (C:\users\webdriver\chromedriver.exe). You can download a web driver for your browser at [this site](https://selenium-python.readthedocs.io/installation.html#:~:text=Selenium%20requires%20a-,driver,-to%20interface%20with).

## Overview of Selenium

We need to review some basic information on Selenium. The `HTML` content of the webpages will be parsed and scraped using Selenium. 

Selenium is a **python library** that can scrape data on websites dynamically. It can also be used for web automation & testing. Scraping data from the web is only a small portion of the selenium library. 

Some of the features of Selenium include:

* Multi-Browser Compatibility
* Multiple Language Support
* Handling of Dynamic Web Elements
* Easy to Identify Web Elements
* Speed and Performance
* Open Source and Portable

Learn more about Selenium at [here](https://selenium-python.readthedocs.io/).

Selenium web driver offers a variety of locater functions to locate elements on the web page. For this project, we will be locating data elements using the `XPath` function. `XPath` is a language used for locating data values within `HTML` tags and attributes like `class`, `id`, and `name`.

The syntax for the `XPath` function is shown below.

```python
Xpath = //tagname[@Attribute='Value']

//        ➡ Select Current Node
tagname   ➡ Tagname like div, td, tr
@         ➡ Selects attribute
Attribute ➡ Attribute name (class, id, name, etc)
value     ➡ value of the attribute

## Examples ##

# This will return the HTML with tag div and class movie in a list.
Xpath = //div[@class='movie']

# This will return all the links with class drama inside the div tag that has a class of movie.
Xpath = //div[@class='movie']/a[@class='drama']
```

```html
<!--HTML looks like this:-->

<div class='movie'>
    <a class='drama' href='IMDB.com'> IMDB</a>
</div>
```

After reviewing the Box Office Mojo webpage, let's extract the following data elements for each movie.

* Rank
* Title
* Lifetime Gross
* Year


## Review Web Page's HTML Structure

We need to understand the structure and contents of the `HTML` tags within the web page. For this project, we will be using the [Box Office Mojo website](https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW) that contains the top 200 highest-grossing movies (shown below).

![](https://miro.medium.com/max/700/1*WvbDq6TcWHwThMDTp1JbrQ.png)

We can scrape this webpage by parsing the `HTML` of the page and extracting the information needed for our dataset. To scrape some data from this web page right-click anywhere on the web page, click the arrow icon on the upper left-hand side of the screen with the `HTML` and then click the Title name (Avatar) on the first line of the webpage. This will result in the following screen being displayed.

![](https://miro.medium.com/max/700/1*fVsRbMa6kp5emjE1uvnQAg.png)

On the `HTML` screen, you will see highlighted the `HTML` line for the title's name Avatar (shown below).

```html
<a class="a-link-normal" href="/title/tt0499549/?ref_=bo_cso_table_1">Avatar</a>
```

The `<a` is referred to as the tag and the class is `"a-link-normal"`. Avatar is the name of the movie that we want to extract.

If you move up one line from this tag you will find the tag td with a class of `"a-text-left mojo-field-type-title"` (shown below). This is the parent of the `<a` tag with class `"a-link-normal"`.

```html
<td class="a-text-left mojo-field-type-title" style="width: 616px; height: 34px; min-width: 616px; min-height: 34px;">
```

So, if you wanted to find, extract and capture all the movie title names on the web page you would follow these steps.

1. Find all the `HTML` lines for a specific parent (`td` tag with a class of "a-text-left mojo-field-type-title").
2. Find all the `HTML` lines for (`a` tag with a class of "a-link-normal") within the parent in step 1.
3. Extract the data elements and build a list containing the movie title names.

The code for finding, extracting, and capturing movie title names is shown below.

# Performing Web Scrape

Below are the steps to scrape the content from a website using Selenium's `driver` and converting contents into `csv` using `pandas`.

In summary, the steps are:

1. Get URL of the website.
2. Extract content using Selenium's `driver.find_elements()` function
3. Append and convert all elements converted by using `.text` into a list
4. Combine all list into a list using `python`'s `zip()` function.
5. Convert the 2D list into a `pandas` dataframe.
6. Export the dataframe into a `csv` file for future use.

## Find and Extract Data Elements

For each of the data elements we want to extract, we will find all the `HTML` lines that are within a specific tag and class. We will then extract the data elements and store the data in a list.

The `get` command launches the browser and opens the specified URL in your web driver.

In [33]:
driver.get('https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW')

1. Find, extract and capture **Movie Rankings** in a list:

In [34]:
# Get elements from HTML
all_rankings = driver.find_elements(By.XPATH, "(//td[@class = 'a-text-right mojo-header-column mojo-truncate mojo-field-type-rank'])")

# Create list
movie_rank_list = []

# Convert elements and save in list
for ranking in all_rankings:
  movie_rank_list.append(ranking.text)

print(movie_rank_list)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '

2. Find, extract and capture **Movie Titles** in a list:

In [35]:
# Get elements from HTML
all_titles = driver.find_elements(By.XPATH, "(//td[@class = 'a-text-left mojo-field-type-title'])/a[@class = 'a-link-normal']")

# Create list
movie_title_list = []

# Convert elements and save in list
for title in all_titles:
  movie_title_list.append(title.text)

print(movie_title_list)

['Avatar', 'Avengers: Endgame', 'Titanic', 'Star Wars: Episode VII - The Force Awakens', 'Avengers: Infinity War', 'Spider-Man: No Way Home', 'Jurassic World', 'The Lion King', 'The Avengers', 'Furious 7', 'Frozen II', 'Avengers: Age of Ultron', 'Black Panther', 'Harry Potter and the Deathly Hallows: Part 2', 'Star Wars: Episode VIII - The Last Jedi', 'Jurassic World: Fallen Kingdom', 'Frozen', 'Beauty and the Beast', 'Incredibles 2', 'The Fate of the Furious', 'Iron Man 3', 'Minions', 'Captain America: Civil War', 'Aquaman', 'The Lord of the Rings: The Return of the King', 'Spider-Man: Far from Home', 'Captain Marvel', 'Transformers: Dark of the Moon', 'Skyfall', 'Transformers: Age of Extinction', 'Jurassic Park', 'The Dark Knight Rises', 'Joker', 'Star Wars: Episode IX - The Rise of Skywalker', 'Toy Story 4', 'Toy Story 3', "Pirates of the Caribbean: Dead Man's Chest", 'The Lion King', 'Rogue One: A Star Wars Story', 'Aladdin', 'Pirates of the Caribbean: On Stranger Tides', 'Despicab

3. Find, extract and capture **Movie Release Years** in a list:

In [36]:
# Get elements from HTML
all_releaseyrs = driver.find_elements(By.XPATH, "(//td[@class = 'a-text-left mojo-field-type-year']/a[@class = 'a-link-normal'])")

# Create list
movie_release_years = []

# Convert elements and save in list
for year in all_releaseyrs:
  movie_release_years.append(year.text)

print(movie_release_years)

['2009', '2019', '1997', '2015', '2018', '2021', '2015', '2019', '2012', '2015', '2019', '2015', '2018', '2011', '2017', '2018', '2013', '2017', '2018', '2017', '2013', '2015', '2016', '2018', '2003', '2019', '2019', '2011', '2012', '2014', '1993', '2012', '2019', '2019', '2019', '2010', '2006', '1994', '2016', '2019', '2011', '2017', '2016', '1999', '2010', '2016', '2001', '2012', '2008', '2010', '2013', '2016', '2017', '2014', '2007', '2013', '2002', '2007', '2003', '2009', '2004', '2018', '2021', '2001', '2005', '2007', '2009', '2022', '2015', '2017', '2002', '2012', '2016', '2016', '2017', '2005', '2013', '2017', '2015', '2018', '2017', '2010', '2009', '2012', '2002', '2017', '2021', '1996', '2016', '2007', '2017', '2019', '2004', '2017', '1982', '2018', '2009', '2008', '2004', '2013', '2018', '2016', '1977', '2021', '2014', '2022', '2006', '2019', '2014', '2012', '2014', '2010', '2012', '2016', '2014', '2005', '2013', '2003', '2009', '2019', '2021', '2013', '2014', '2011', '2009',

4. Find, extract and capture **Lifetime Gross Earnings** in a list:

In [37]:
# Get elements from HTML
all_earnings = driver.find_elements(By.XPATH, "(//td[@class = 'a-text-right mojo-field-type-money'])")

# Create list
movie_lifetime_gross_earnings = []

# Convert elements and save in list
for earning in all_earnings:
  movie_lifetime_gross_earnings.append(earning.text)

print(movie_lifetime_gross_earnings)

['$2,847,397,339', '$2,797,501,328', '$2,201,647,264', '$2,069,521,700', '$2,048,359,754', '$1,892,761,122', '$1,671,537,444', '$1,662,899,439', '$1,518,815,515', '$1,515,341,399', '$1,450,026,933', '$1,402,809,540', '$1,347,597,973', '$1,342,359,942', '$1,332,698,830', '$1,310,466,296', '$1,281,508,100', '$1,273,576,220', '$1,243,089,244', '$1,236,005,118', '$1,214,811,252', '$1,159,444,662', '$1,153,337,496', '$1,148,528,393', '$1,146,436,214', '$1,131,927,996', '$1,128,462,972', '$1,123,794,079', '$1,108,569,499', '$1,104,054,072', '$1,099,699,003', '$1,081,153,097', '$1,074,445,730', '$1,074,149,279', '$1,073,394,593', '$1,066,970,811', '$1,066,179,747', '$1,063,611,805', '$1,056,057,720', '$1,050,693,953', '$1,045,713,802', '$1,034,800,131', '$1,028,570,942', '$1,027,082,707', '$1,025,468,216', '$1,024,121,104', '$1,017,879,803', '$1,017,030,651', '$1,006,102,277', '$977,070,383', '$970,766,005', '$966,554,929', '$962,542,945', '$962,201,338', '$960,996,492', '$959,027,992', '$947

## Create and Display Data Frame

We will use `zip()` function to merge all lists into a single list, a 2D list is created.

More info on `zip()` can be found [here](https://www.w3schools.com/python/ref_func_zip.asp).

In [38]:
# Combine Lists
data = list(zip(movie_rank_list, movie_title_list, movie_release_years, movie_lifetime_gross_earnings))

# Create the Data Frame
df = pd.DataFrame(data, columns=['Rank', 'Movie Name', 'Release Date','Lifetime Earnings'])

# Prints the Data Frame
df.head(10)

Unnamed: 0,Rank,Movie Name,Release Date,Lifetime Earnings
0,1,Avatar,2009,"$2,847,397,339"
1,2,Avengers: Endgame,2019,"$2,797,501,328"
2,3,Titanic,1997,"$2,201,647,264"
3,4,Star Wars: Episode VII - The Force Awakens,2015,"$2,069,521,700"
4,5,Avengers: Infinity War,2018,"$2,048,359,754"
5,6,Spider-Man: No Way Home,2021,"$1,892,761,122"
6,7,Jurassic World,2015,"$1,671,537,444"
7,8,The Lion King,2019,"$1,662,899,439"
8,9,The Avengers,2012,"$1,518,815,515"
9,10,Furious 7,2015,"$1,515,341,399"


## Convert Data Frame to `csv` File

If needed, we can create a `csv` file from the data frame that was created in the previous step.

In [50]:
df.to_csv('Top_200_Movies_with_Lifetime_Gross.csv', index = False)

We can also read the `csv` file using `pandaS`.

In [52]:
pd.read_csv('Top_200_Movies_with_Lifetime_Gross.csv', lineterminator = '\n', index_col = 0)

Unnamed: 0_level_0,Movie Name,Release Date,Lifetime Earnings
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Avatar,2009,"$2,847,397,339"
2,Avengers: Endgame,2019,"$2,797,501,328"
3,Titanic,1997,"$2,201,647,264"
4,Star Wars: Episode VII - The Force Awakens,2015,"$2,069,521,700"
5,Avengers: Infinity War,2018,"$2,048,359,754"
...,...,...,...
196,X-Men: Apocalypse,2016,"$543,934,105"
197,Sherlock Holmes: A Game of Shadows,2011,"$543,848,418"
198,Despicable Me,2010,"$543,157,985"
199,Cinderella,2015,"$542,358,331"
