<div id="header"><p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:25px;">Web Scrapping with Python</p></div>

---

<p style="text-align:right; font-family:verdana;">Follow <a href="https://github.com/TheMrityunjayPathak" style="color:#6a66bd; text-decoration:none;">@Mrityunjay Pathak</a> for more!</p>

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Web Scrapping</font>
<br>
• Web scraping is the process of extracting information from websites. 
<br>
• Two popular tools for this task are Selenium and BeautifulSoup. 
<br>
• Each has its strengths and is often used together to leverage their combined capabilities.
<br>
<br>
<font color='#6a66bd' size="5px">BeautifulSoup</font>
<br>
• BeautifulSoup is a library for parsing HTML and XML documents. 
<br>
• It helps to navigate the HTML structure, search for elements, and extract data. 
<br>
• It is particularly effective for handling and cleaning up the HTML after fetching it, making it easier to extract the desired information.
<br>
<br>
<font color='#6a66bd' size="5px">Selenium</font>
<br>
• Selenium is a powerful tool for automating web browsers. 
<br>
• It can simulate user interactions, such as clicking buttons, filling out forms, and scrolling through pages. 
<br>
• This makes it especially useful for scraping dynamic content that is loaded via JavaScript or requires user actions.
</div>

---

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<font color='#6a66bd' size="5px">Libraries Used</font>
<br>
<br>
<strong>pandas</strong>
<br>
• This library is used for data manipulation and analysis. 
<br>
• It provides powerful data structures like DataFrames which are great for organizing and analyzing data scraped from websites. 
<br>
• For example, after scraping data you can use pandas to clean and save the data in various formats like CSV or Excel.
<br>
<br>
<strong>requests</strong>
<br>
• This library allows you to send HTTP requests using Python. 
<br>
• It's often used to fetch the HTML content of a webpage. 
<br>
• With requests, you can easily retrieve the page source which can then be parsed to extract the desired information.
<br>
<br>
<strong>BeautifulSoup</strong>
<br>
• This library is used for parsing HTML and XML documents.
<br>
• It makes it easy to navigate and search the HTML structure of a webpage. 
<br>
• After retrieving the HTML content using requests you can use BeautifulSoup to parse and extract specific elements of the webpage.
<br>
<br>
<strong>selenium</strong>
<br>
• This library is used for automating web browsers.
<br>
• It allows you to interact with web pages, which is particularly useful for scraping dynamic content that is loaded via JavaScript. 
<br>
• Selenium can simulate user interactions such as clicking buttons, filling out forms and scrolling.
<br>
<br>
<strong>webdriver_manager</strong>
<br>
• This library is used to automatically manage browser drivers for Selenium.
<br>
• Instead of manually downloading and setting up browser drivers, webdriver_manager handles the installation and setup for you.
<br>
• This simplifies the process of ensuring you have the correct driver version for your browser.
</div>

<a href="https://www.imdb.com/chart/toptv/" text-decoration="none" color="black"><img src="https://raw.githubusercontent.com/TheMrityunjayPathak/Data-Science-with-Python/main/Web%20Scrapping/images/home.png" border="2px solid black"></a>

In [1]:
import pandas as pd
# requests
import requests
# BeautifulSoup
from bs4 import BeautifulSoup
# selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

In [2]:
# Setting Up Selenium WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Navigating to a Web Page
driver.get("https://www.imdb.com/chart/toptv/")

# Scrolling to the Bottom of the Page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Adding a Time Gap for Loading Content
import time
time.sleep(10)

# Saving the Final HTML and Closing the Browser
html = driver.page_source
driver.quit()

In [3]:
# Creating BeautifulSoup Object
soup = BeautifulSoup(html, "lxml")

In [6]:
# Heading of the Webpage
heading = soup.find("h1").text
print(heading)

Top 250 TV Shows


<img src="https://raw.githubusercontent.com/TheMrityunjayPathak/Data-Science-with-Python/main/Web%20Scrapping/images/heading.png" width="500px" border="2px solid black">

In [7]:
# TV Shows Name
lst1 = []
for i in soup.find_all("h3","ipc-title__text"):
    lst1.append(i.text)

print(lst1)

['IMDb Charts', '1. Breaking Bad', '2. Planet Earth II', '3. Planet Earth', '4. Band of Brothers', '5. Chernobyl', '6. The Wire', '7. Avatar: The Last Airbender', '8. Blue Planet II', '9. The Sopranos', '10. Cosmos: A Spacetime Odyssey', '11. Cosmos', '12. Our Planet', '13. Game of Thrones', '14. Bluey', '15. The World at War', '16. Fullmetal Alchemist: Brotherhood', '17. Rick and Morty', '18. Life', '19. The Last Dance', '20. The Twilight Zone', '21. The Vietnam War', '22. Sherlock', '23. Attack on Titan', '24. Batman: The Animated Series', '25. The Office', '26. The Blue Planet', '27. Better Call Saul', '28. Arcane', '29. Scam 1992: The Harshad Mehta Story', "30. Clarkson's Farm", '31. Human Planet', '32. Frozen Planet', '33. Firefly', '34. Hunter x Hunter', '35. Only Fools and Horses', '36. The Civil War', '37. Death Note', '38. Seinfeld', '39. Dekalog', '40. Gravity Falls', '41. The Beatles: Get Back', '42. True Detective', '43. Cowboy Bebop', '44. Fargo', '45. Persona', '46. Natha

<img src="https://raw.githubusercontent.com/TheMrityunjayPathak/Data-Science-with-Python/main/Web%20Scrapping/images/tv%20show%20name.png" width="500px" border="2px solid black">

In [8]:
# To remove additional <h3> Text from TV Shows Name
lst1 = lst1[1:251]
print(lst1)

['1. Breaking Bad', '2. Planet Earth II', '3. Planet Earth', '4. Band of Brothers', '5. Chernobyl', '6. The Wire', '7. Avatar: The Last Airbender', '8. Blue Planet II', '9. The Sopranos', '10. Cosmos: A Spacetime Odyssey', '11. Cosmos', '12. Our Planet', '13. Game of Thrones', '14. Bluey', '15. The World at War', '16. Fullmetal Alchemist: Brotherhood', '17. Rick and Morty', '18. Life', '19. The Last Dance', '20. The Twilight Zone', '21. The Vietnam War', '22. Sherlock', '23. Attack on Titan', '24. Batman: The Animated Series', '25. The Office', '26. The Blue Planet', '27. Better Call Saul', '28. Arcane', '29. Scam 1992: The Harshad Mehta Story', "30. Clarkson's Farm", '31. Human Planet', '32. Frozen Planet', '33. Firefly', '34. Hunter x Hunter', '35. Only Fools and Horses', '36. The Civil War', '37. Death Note', '38. Seinfeld', '39. Dekalog', '40. Gravity Falls', '41. The Beatles: Get Back', '42. True Detective', '43. Cowboy Bebop', '44. Fargo', '45. Persona', '46. Nathan for You', '47

In [9]:
# Name of TV Show
name_list = []
for i in lst1:
    name_list.append(i.split(".")[1].strip())

print(name_list)

['Breaking Bad', 'Planet Earth II', 'Planet Earth', 'Band of Brothers', 'Chernobyl', 'The Wire', 'Avatar: The Last Airbender', 'Blue Planet II', 'The Sopranos', 'Cosmos: A Spacetime Odyssey', 'Cosmos', 'Our Planet', 'Game of Thrones', 'Bluey', 'The World at War', 'Fullmetal Alchemist: Brotherhood', 'Rick and Morty', 'Life', 'The Last Dance', 'The Twilight Zone', 'The Vietnam War', 'Sherlock', 'Attack on Titan', 'Batman: The Animated Series', 'The Office', 'The Blue Planet', 'Better Call Saul', 'Arcane', 'Scam 1992: The Harshad Mehta Story', "Clarkson's Farm", 'Human Planet', 'Frozen Planet', 'Firefly', 'Hunter x Hunter', 'Only Fools and Horses', 'The Civil War', 'Death Note', 'Seinfeld', 'Dekalog', 'Gravity Falls', 'The Beatles: Get Back', 'True Detective', 'Cowboy Bebop', 'Fargo', 'Persona', 'Nathan for You', 'As If', 'Taskmaster', 'Apocalypse: The Second World War', 'When They See Us', 'Last Week Tonight with John Oliver', 'Africa', 'Friends', 'Succession', "It's Always Sunny in Phil

In [10]:
# IMDB Rating
rating = []
for i in soup.find_all("span","ipc-rating-star--rating"):
    rating.append(i.text.strip())

print(rating)

['9.5', '9.5', '9.4', '9.4', '9.3', '9.3', '9.3', '9.3', '9.2', '9.2', '9.3', '9.3', '9.2', '9.4', '9.2', '9.1', '9.1', '9.1', '9.1', '9.1', '9.1', '9.1', '9.1', '9.0', '9.0', '9.0', '9.0', '9.0', '9.2', '9.0', '9.0', '9.0', '8.9', '9.0', '9.0', '9.0', '8.9', '8.9', '8.9', '8.9', '8.9', '8.9', '8.9', '8.9', '9.0', '8.9', '9.0', '9.0', '9.0', '8.8', '8.8', '8.9', '8.9', '8.8', '8.8', '9.1', '8.9', '8.8', '8.8', '8.8', '9.0', '8.9', '8.8', '8.8', '9.1', '8.8', '8.8', '8.8', '8.8', '8.8', '8.8', '8.8', '8.8', '8.8', '9.0', '9.1', '8.8', '8.7', '9.0', '8.8', '8.7', '8.8', '8.8', '9.1', '9.0', '8.8', '8.7', '8.7', '8.7', '8.8', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.7', '8.9', '8.7', '8.7', '8.9', '8.7', '8.7', '8.7', '8.7', '8.9', '8.7', '8.6', '9.2', '8.7', '8.7', '8.6', '8.6', '8.7', '8.6', '8.6', '8.6', '8.6', '8.6', '8.6', '8.7', '8.6', '8.6', '8.6', '9.0'

<img src="https://raw.githubusercontent.com/TheMrityunjayPathak/Data-Science-with-Python/main/Web%20Scrapping/images/ratings.png" width="500px" border="2px solid black">

In [19]:
# Votes for TV Shows
lst2 = []
for i in soup.find_all("span","ipc-rating-star--voteCount"):
    lst2.append(i.text.strip())

print(lst2)

['(2.2M)', '(160K)', '(222K)', '(537K)', '(886K)', '(384K)', '(382K)', '(48K)', '(485K)', '(130K)', '(45K)', '(53K)', '(2.3M)', '(31K)', '(30K)', '(203K)', '(615K)', '(43K)', '(156K)', '(95K)', '(28K)', '(1M)', '(538K)', '(120K)', '(730K)', '(43K)', '(668K)', '(283K)', '(159K)', '(63K)', '(28K)', '(33K)', '(284K)', '(142K)', '(57K)', '(19K)', '(390K)', '(356K)', '(28K)', '(141K)', '(29K)', '(667K)', '(142K)', '(428K)', '(48K)', '(39K)', '(24K)', '(21K)', '(15K)', '(140K)', '(97K)', '(17K)', '(1.1M)', '(284K)', '(255K)', '(73K)', '(87K)', '(79K)', '(150K)', '(34K)', '(252K)', '(22K)', '(194K)', '(101K)', '(97K)', '(93K)', '(155K)', '(57K)', '(151K)', '(33K)', '(218K)', '(89K)', '(478K)', '(73K)', '(96K)', '(24K)', '(20K)', '(650K)', '(85K)', '(78K)', '(410K)', '(73K)', '(361K)', '(13K)', '(38K)', '(667K)', '(152K)', '(188K)', '(108K)', '(76K)', '(555K)', '(456K)', '(209K)', '(56K)', '(89K)', '(74K)', '(227K)', '(178K)', '(440K)', '(215K)', '(708K)', '(264K)', '(526K)', '(327K)', '(67K)'

<img src="https://raw.githubusercontent.com/TheMrityunjayPathak/Data-Science-with-Python/main/Web%20Scrapping/images/votes.png" width="500px" border="2px solid black">

In [18]:
# Audience Votes
votes = []
for i in lst2:
    votes.append(i.replace("(","").replace(")","").strip())

print(votes)

['2.2M', '160K', '222K', '537K', '886K', '384K', '382K', '48K', '485K', '130K', '45K', '53K', '2.3M', '31K', '30K', '203K', '615K', '43K', '156K', '95K', '28K', '1M', '538K', '120K', '730K', '43K', '668K', '283K', '159K', '63K', '28K', '33K', '284K', '142K', '57K', '19K', '390K', '356K', '28K', '141K', '29K', '667K', '142K', '428K', '48K', '39K', '24K', '21K', '15K', '140K', '97K', '17K', '1.1M', '284K', '255K', '73K', '87K', '79K', '150K', '34K', '252K', '22K', '194K', '101K', '97K', '93K', '155K', '57K', '151K', '33K', '218K', '89K', '478K', '73K', '96K', '24K', '20K', '650K', '85K', '78K', '410K', '73K', '361K', '13K', '38K', '667K', '152K', '188K', '108K', '76K', '555K', '456K', '209K', '56K', '89K', '74K', '227K', '178K', '440K', '215K', '708K', '264K', '526K', '327K', '67K', '190K', '138K', '215K', '85K', '25K', '21K', '1.4M', '145K', '177K', '73K', '24K', '48K', '72K', '33K', '128K', '120K', '161K', '27K', '18K', '14K', '599K', '313K', '162K', '44K', '535K', '90K', '83K', '292K'

In [17]:
# Collecting all the Data into a DataFrame
tv_shows = pd.DataFrame({"Name":name_list,"Ratings":rating,"Votes":votes})
tv_shows.head(10)

Unnamed: 0,Name,Ratings,Votes
0,Breaking Bad,9.5,2.2M
1,Planet Earth II,9.5,160K
2,Planet Earth,9.4,222K
3,Band of Brothers,9.4,537K
4,Chernobyl,9.3,886K
5,The Wire,9.3,384K
6,Avatar: The Last Airbender,9.3,382K
7,Blue Planet II,9.3,48K
8,The Sopranos,9.2,485K
9,Cosmos: A Spacetime Odyssey,9.2,130K


In [16]:
# Saving the DataFrame as CSV File
tv_shows.to_csv("tv_shows.csv",index=False)

---

<p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:25px;">Thanks 👏 for Visiting!</p>