# Webscraping using selenium and python

### What is selenium ?

Selenium is a popular open-source tool for automating web browsers. It allows users to write scripts in Python (among other languages) to automate tasks on the web, such as filling out forms, clicking buttons, and extracting data from websites.

### Why selenium ?

1. Easy to read & code.
2. Open Source.
3. Fast In Execution.
4. Can run tests across different browsers.
5. Automates Browser Easily.
6. Beginners Friendly.

### Installation

To use Selenium in Python, you will first need to install the Selenium package using the following command:

In [2]:
#!pip install selenium

and also you need to download chromedriver using the below link:
https://selenium-python.readthedocs.io/installation.html#drivers

### Basic Code

In [19]:
from selenium import webdriver

DRIVER_PATH = '/path/to/chromedriver'
browser = webdriver.Chrome(executable_path=DRIVER_PATH)

browser.get('https://www.google.com')

browser.quit()

## Key Components of Selenium

1. Element Locators
2. Web Interaction & Navigation
3. Waits

### Element Locators

WebDriver provides two main methods for finding elements.

1. find_element 
2. find_elements

|         Type         |                                   Description                                  |          DOM Sample         |                   Example                  |
|:--------------------:|:------------------------------------------------------------------------------:|:---------------------------:|:------------------------------------------:|
| By.ID                | Searches for elements based on their HTML ID                                   |       <div id="myID">       | find_element(By.ID, "myID")                |
| By.NAME              | Searches for elements based on their name attribute                            |    <input name="myNAME">    | find_element(By.NAME, "myNAME")            |
| By.XPATH             | Searches for elements based on an XPath expression                             | <span>My <a>Link</a></span> | find_element(By.XPATH, "//span/a")         |
| By.LINK_TEXT         | Searches for anchor elements based on a match of their text content            |        <a>My Link</a>       | find_element(By.LINK_TEXT, "My Link")      |
| By.PARTIAL_LINK_TEXT | Searches for anchor elements based on a sub-string match of their text content |        <a>My Link</a>       | find_element(By.PARTIAL_LINK_TEXT, "Link") |
| By.TAG_NAME          | Searches for elements based on their tag name                                  |             <h1>            | find_element(By.TAG_NAME, "h1")            |
| By.CLASS_NAME        | Searches for elements based on their HTML classes                              |    <div class="myCLASS">    | find_element(By.CLASSNAME, "myCLASS")      |
| By.CSS_SELECTOR      | Searches for elements based on a CSS selector                                  | <span>My <a>Link</a></span> | find_element(By.CSS_SELECTOR, "span > a")  |

A full description of the methods can be found here: https://selenium-python.readthedocs.io/locating-elements.html

### Web Interaction & Navigation

In [20]:
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium import webdriver

# Open a web browser and navigate to the specified URL
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.get('https://www.google.com/')

element = browser.find_element(By.XPATH, "/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input")
#element = browser.find_element(By.XPATH, "//input[@type='text']")
element.send_keys("some text") ## pass the text to the element
element.send_keys(Keys.ENTER)         ##Press ENTER

# Close the browser
#browser.quit()

### Waits

1. Explicit Wait

In [24]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium import webdriver

# Open a web browser and navigate to the specified URL
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.get('https://www.google.com/')

try:
    element = WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input"))
    )
    element.send_keys("some text") ## pass the text to the element
    element.send_keys(Keys.ENTER)         ##Press ENTER
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") ## Will Scroll Till The Bottom of Web Page
finally:
    browser.quit()

# Close the browser
#browser.quit()




2. Implicit Wait

In [None]:
driver.implicitly_wait(10) # 10 second wait

# Important Code Snippets 

Scrolling till the bottom of the page

In [None]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
## Above Code Will Scroll Till The Bottom of Web Page
page_height = driver.execute_script("return document.body.scrollHeight")
for value in range(0,page_height):
    driver.execute_script(f"window.scrollTo(0, {value});")
## Above Code Will Smooth Scrolling Till The End

Running Selenium code in Background

In [None]:
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
PATH = 'WEBDRIVE_PATH'
driver = webdriver.Chrome(PATH,options=chrome_options)

# Mini Project

In [4]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import pandas as pd

In [5]:
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW')

In [6]:
movies_names = driver.find_elements(By.XPATH, '//td[@class="a-text-left mojo-field-type-title"]/a[@class="a-link-normal"]')

In [7]:
#movies_names = driver.find_elements(By.XPATH, '//*[@id="table"]/div/table[2]/tbody/tr[5]/td[2]')
movie_name_list = []
for movie in range(len(movies_names)):
    movie_name_list.append(movies_names[movie].text)
print(movie_name_list) ##it will print all the movie names inside a list

['Avatar', 'Avengers: Endgame', 'Titanic', 'Star Wars: Episode VII - The Force Awakens', 'Avengers: Infinity War', 'Spider-Man: No Way Home', 'Jurassic World', 'The Lion King', 'The Avengers', 'Furious 7', 'Top Gun: Maverick', 'Frozen II', 'Avengers: Age of Ultron', 'Black Panther', 'Harry Potter and the Deathly Hallows: Part 2', 'Star Wars: Episode VIII - The Last Jedi', 'Jurassic World: Fallen Kingdom', 'Beauty and the Beast', 'Frozen', 'Incredibles 2', 'The Fate of the Furious', 'Iron Man 3', 'Minions', 'Captain America: Civil War', 'Aquaman', 'The Lord of the Rings: The Return of the King', 'Spider-Man: Far from Home', 'Captain Marvel', 'Transformers: Dark of the Moon', 'Jurassic Park', 'Skyfall', 'Transformers: Age of Extinction', 'The Dark Knight Rises', 'Joker', 'Star Wars: Episode IX - The Rise of Skywalker', 'Toy Story 4', 'Toy Story 3', "Pirates of the Caribbean: Dead Man's Chest", 'The Lion King', 'Rogue One: A Star Wars Story', 'Aladdin', 'Pirates of the Caribbean: On Stran

In [8]:
release_year = driver.find_elements(By.XPATH, '//td[@class="a-text-left mojo-field-type-year"]/a[@class="a-link-normal"]')

In [9]:
release_year_list = []
for year in range(len(release_year)):
    release_year_list.append(release_year[year].text)
print(release_year_list)

['2009', '2019', '1997', '2015', '2018', '2021', '2015', '2019', '2012', '2015', '2022', '2019', '2015', '2018', '2011', '2017', '2018', '2017', '2013', '2018', '2017', '2013', '2015', '2016', '2018', '2003', '2019', '2019', '2011', '1993', '2012', '2014', '2012', '2019', '2019', '2019', '2010', '2006', '1994', '2016', '2019', '2011', '2016', '2017', '2016', '1999', '2010', '2001', '2012', '2008', '2022', '2017', '2010', '2013', '2016', '2014', '2007', '2013', '2022', '2002', '2007', '2003', '2022', '2009', '2004', '2002', '2018', '2021', '2001', '2005', '2007', '2016', '2009', '2015', '2017', '2012', '2016', '2017', '2005', '2013', '2017', '2015', '2018', '2017', '2012', '2010', '2009', '2002', '2017', '2021', '1996', '2016', '2007', '2017', '2019', '2004', '2017', '1982', '2018', '2009', '2008', '2004', '2013', '2018', '2016', '1977', '2021', '2014', '2014', '2022', '2022', '2022', '2019', '2006', '2014', '2012', '2014', '2010', '2012', '2016', '2014', '2005', '2013', '2003', '2009',

In [10]:
lifetime_gross = driver.find_elements_by_xpath('//td[@class="a-text-right mojo-field-type-money"]')
lifetime_gross_list = []
for i in range(len(lifetime_gross)):
    lifetime_gross_list.append(lifetime_gross[i].text)
print(lifetime_gross_list)

['$2,922,917,914', '$2,797,501,328', '$2,201,647,264', '$2,069,521,700', '$2,048,359,754', '$1,916,306,995', '$1,671,537,444', '$1,663,250,487', '$1,518,815,515', '$1,515,341,399', '$1,488,519,000', '$1,450,026,933', '$1,402,809,540', '$1,382,248,826', '$1,342,359,942', '$1,332,698,830', '$1,310,466,296', '$1,305,611,599', '$1,304,550,716', '$1,243,089,244', '$1,236,005,118', '$1,214,811,252', '$1,159,444,662', '$1,153,337,496', '$1,148,528,393', '$1,146,436,214', '$1,131,927,996', '$1,128,462,972', '$1,123,794,079', '$1,109,802,321', '$1,108,569,499', '$1,104,054,072', '$1,081,169,825', '$1,074,458,282', '$1,074,149,279', '$1,073,394,593', '$1,066,970,811', '$1,066,179,747', '$1,063,611,805', '$1,058,682,142', '$1,050,693,953', '$1,045,713,802', '$1,042,533,689', '$1,034,800,131', '$1,028,570,942', '$1,027,082,707', '$1,025,468,216', '$1,023,842,938', '$1,017,030,651', '$1,006,234,167', '$1,001,136,080', '$995,339,117', '$977,070,383', '$970,766,005', '$966,554,929', '$962,201,338', '

In [11]:
data =list( zip(movie_name_list, release_year_list, lifetime_gross_list))
df = pd.DataFrame(data,columns=['Movie Name', 'Release Date','Lifetime Earnings'])
df.to_csv('top_200_movies_with_lifetime_gross.csv',index=False)