# Beautiful Soup Practical

In [1]:
!pip install bs4
!pip install requests



Observation: First we install 'Beautiful Soup'. It is a python package for parsing HTML and XML documents. It creates a parse tree for parsed that can be used to extract data from HTML, which is useful for web scraping. 

# Import Required Libraries

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

Observation: After installing 'Beautiful Soup', we import 'bs4'. Here, we import pandas for data fram work and import requests. These libraries help us to fetch data from web sites and make dataframe.

# Q.1 Find restaurants list at delhi

## Send get request to the webpage server to get the source code of the page

In [3]:
page = requests.get('https://www.dineout.co.in/delhi-restaurants/buffet-special')

Observation: Send request to the HTML.

In [4]:
page

<Response [200]>

Observation: Checking response, do we allow to fetch data from web site or not? If response is 200 it means it allows us. 

## Page content

In [5]:
soup = BeautifulSoup(page.content)
soup

<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="/manifest.json" rel="manifest"/><style type="text/css">
            @font-face {
                font-family: 'dineicon';
                src:  url('/fonts/dineicon.eot');
                src:  url('/fonts/dineicon.eot#iefix') format('embedded-opentype'),
                url('/fonts/dineicon.ttf') format('truetype'),
                url('/fonts/dineicon.woff') format('woff'),
                url('/fonts/dineicon.svg#dineicon') format('svg');
                font-weight: normal;
				font-style: normal;
				font-display: swap;
            }
            .hide {
                display: none !important;
            }
            .async-hide{
                opacity: inherit !important;
            }
            iframe[name="google_conversion_frame"]{
        

Observation: Here, we take page content and put here but as we see it is total mess. So we have to arrange it as per our need.

## Scraping First Name by using 'find' method

In [6]:
# First, we will use html tag where we have the first title of the resturants.
# Here, we use 'find' method which scrap only one item:

first_title = soup.find('div', class_ = "restnt-info cursor")
first_title

<div class="restnt-info cursor" data-gatype="RestaurantNameClick"><a analytics-action="RestaurantCardClick" analytics-label="86792_Castle Barbeque" class="restnt-name ellipsis" data-w-onclick="sendAnalyticsCommon|w1-restarant" href="/delhi/castle-barbeque-connaught-place-central-delhi-86792">Castle Barbeque</a><div class="restnt-loc ellipsis" data-w-onclick="stopClickPropagation|w1-restarant"><a data-name="Connaught Place" data-type="LocalityClick" href="/delhi-restaurants/central-delhi/connaught-place">Connaught Place</a>, <a data-name="Central Delhi" data-type="AreaClick" href="/delhi-restaurants/central-delhi">Central Delhi</a></div></div>

In [7]:
first_title.text

'Castle BarbequeConnaught Place, Central Delhi'

## Scraping first location

In [8]:
loc = soup.find('div', class_ = "restnt-loc ellipsis")
loc.text

'Connaught Place, Central Delhi'

## Scraping first price

In [9]:
price = soup.find('span', class_ = "double-line-ellipsis")
price.text

'₹ 2,000 for 2 (approx) | Chinese, North Indian'

In [10]:
price.text.replace('₹', '')

' 2,000 for 2 (approx) | Chinese, North Indian'

In [11]:
price.text.split()

['₹', '2,000', 'for', '2', '(approx)', '|', 'Chinese,', 'North', 'Indian']

In [12]:
price.text.split()[1]

'2,000'

## Scraping Multiple Titles by Using 'find_all' method:

In [13]:
# Creating an empty list:
titles = []

# Using for loop to fetch all titles and append into titles list:
for i in soup.find_all('div', class_ = "restnt-info cursor"):
    titles.append(i.text)
    
titles

['Castle BarbequeConnaught Place, Central Delhi',
 'Jungle Jamboree3CS Mall,Lajpat Nagar - 3, South Delhi',
 'Cafe KnoshThe Leela Ambience Convention Hotel,Shahdara, East Delhi',
 'Castle BarbequePacific Mall,Tagore Garden, West Delhi',
 'The Barbeque CompanyGardens Galleria,Sector 38A, Noida',
 'India GrillHilton Garden Inn,Saket, South Delhi',
 'Delhi BarbequeTaurus Sarovar Portico,Mahipalpur, South Delhi',
 'The Monarch - Bar Be Que VillageIndirapuram Habitat Centre,Indirapuram, Ghaziabad',
 'Indian Grill RoomSuncity Business Tower,Golf Course Road, Gurgaon']

## Scrapint Multiple Locations

In [14]:
location = []

for i in soup.find_all('div', class_ = "restnt-loc ellipsis"):
    location.append(i.text)
    
location

['Connaught Place, Central Delhi',
 '3CS Mall,Lajpat Nagar - 3, South Delhi',
 'The Leela Ambience Convention Hotel,Shahdara, East Delhi',
 'Pacific Mall,Tagore Garden, West Delhi',
 'Gardens Galleria,Sector 38A, Noida',
 'Hilton Garden Inn,Saket, South Delhi',
 'Taurus Sarovar Portico,Mahipalpur, South Delhi',
 'Indirapuram Habitat Centre,Indirapuram, Ghaziabad',
 'Suncity Business Tower,Golf Course Road, Gurgaon']

## Scraping Multiple Price

In [15]:
price = []

for i in soup.find_all('span', class_ = "double-line-ellipsis"):
    price.append(i.text.split('|')[0])

    
price

['₹ 2,000 for 2 (approx) ',
 '₹ 1,680 for 2 (approx) ',
 '₹ 3,000 for 2 (approx) ',
 '₹ 2,000 for 2 (approx) ',
 '₹ 1,700 for 2 (approx) ',
 '₹ 2,400 for 2 (approx) ',
 '₹ 1,800 for 2 (approx) ',
 '₹ 1,900 for 2 (approx) ',
 '₹ 2,200 for 2 (approx) ']

## Scraping Cuisins from different resturants

In [16]:
cuisine = []
for i in soup.find_all('span', class_ = "double-line-ellipsis"):
    cuisine.append(i.text.split('|')[1])
    
cuisine

[' Chinese, North Indian',
 ' North Indian, Asian, Italian',
 ' Italian, Continental',
 ' Chinese, North Indian',
 ' North Indian, Chinese',
 ' North Indian, Italian',
 ' North Indian',
 ' North Indian',
 ' North Indian, Mughlai']

## Scraping Ratings

In [17]:
rating = []

for i in soup.find_all('div', class_ = "restnt-rating rating-3"):
    rating.append(i.text)
for i in soup.find_all('div', class_ = "restnt-rating rating-4"):
    rating.append(i.text)
    
rating

['4.1', '3.9', '4.3', '3.9', '4', '3.9', '3.6', '3.8', '4.3']

## Scraping Multiple Image

In [18]:
images = []

for i in soup.find_all('img', class_ = "no-img"):
    images.append(i['data-src'])
    
images

['https://im1.dineout.co.in/images/uploads/restaurant/sharpen/8/k/b/p86792-16062953735fbe1f4d3fb7e.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/5/p/m/p59633-166088382462ff137009010.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/p/m/p406-15438184745c04ccea491bc.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/3/j/o/p38113-15959192065f1fcb666130c.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/7/p/k/p79307-16051787755fad1597f2bf9.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/2/v/t/p2687-1482477169585cce712b90f.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/5/d/i/p52501-1661855212630de5eceb6d2.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/3/n/o/p34822-15599107305cfa594a13c24.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/

In [19]:
# Check the length:
print(len(titles), len(location), len(price), len(cuisine), len(rating), len(images))

9 9 9 9 9 9


## Making the dataframe

In [20]:
import pandas as pd
df = pd.DataFrame({'Titles':titles, 'Location':location, 'Price':price, 'Cuisine':cuisine, 'Ratings':rating, 'Images_URL':images})
df

Unnamed: 0,Titles,Location,Price,Cuisine,Ratings,Images_URL
0,"Castle BarbequeConnaught Place, Central Delhi","Connaught Place, Central Delhi","₹ 2,000 for 2 (approx)","Chinese, North Indian",4.1,https://im1.dineout.co.in/images/uploads/resta...
1,"Jungle Jamboree3CS Mall,Lajpat Nagar - 3, Sout...","3CS Mall,Lajpat Nagar - 3, South Delhi","₹ 1,680 for 2 (approx)","North Indian, Asian, Italian",3.9,https://im1.dineout.co.in/images/uploads/resta...
2,"Cafe KnoshThe Leela Ambience Convention Hotel,...","The Leela Ambience Convention Hotel,Shahdara, ...","₹ 3,000 for 2 (approx)","Italian, Continental",4.3,https://im1.dineout.co.in/images/uploads/resta...
3,"Castle BarbequePacific Mall,Tagore Garden, Wes...","Pacific Mall,Tagore Garden, West Delhi","₹ 2,000 for 2 (approx)","Chinese, North Indian",3.9,https://im1.dineout.co.in/images/uploads/resta...
4,"The Barbeque CompanyGardens Galleria,Sector 38...","Gardens Galleria,Sector 38A, Noida","₹ 1,700 for 2 (approx)","North Indian, Chinese",4.0,https://im1.dineout.co.in/images/uploads/resta...
5,"India GrillHilton Garden Inn,Saket, South Delhi","Hilton Garden Inn,Saket, South Delhi","₹ 2,400 for 2 (approx)","North Indian, Italian",3.9,https://im1.dineout.co.in/images/uploads/resta...
6,"Delhi BarbequeTaurus Sarovar Portico,Mahipalpu...","Taurus Sarovar Portico,Mahipalpur, South Delhi","₹ 1,800 for 2 (approx)",North Indian,3.6,https://im1.dineout.co.in/images/uploads/resta...
7,The Monarch - Bar Be Que VillageIndirapuram Ha...,"Indirapuram Habitat Centre,Indirapuram, Ghaziabad","₹ 1,900 for 2 (approx)",North Indian,3.8,https://im1.dineout.co.in/images/uploads/resta...
8,"Indian Grill RoomSuncity Business Tower,Golf C...","Suncity Business Tower,Golf Course Road, Gurgaon","₹ 2,200 for 2 (approx)","North Indian, Mughlai",4.3,https://im1.dineout.co.in/images/uploads/resta...


Observation: Now our fetched data is looking far better and well arranged.

# Q.2 Find all the header tags from 'wikipedia.org'

Header = The header element represent a container for introductory content or a set of navigation links.
A header element contains six type of headers, h1 to h6

In [21]:
# Import Libraries:
import pandas as pd
from bs4 import BeautifulSoup
import requests

Observation: Importing essential libraries.

In [22]:
# send requests
page2 = requests.get("https://en.wikipedia.org/wiki/Main_Page")

# page content
soup = BeautifulSoup(page2.content)

# fetching header from page
header_tags = []

for i in soup.find_all(['h1','h2','h3','h4','h5','h6']):
    header_tags.append(i.name+" "+i.text.strip())
    
# print all header_tags
header_tags

['h1 Main Page',
 'h1 Welcome to Wikipedia',
 "h2 From today's featured article",
 'h2 Did you know\xa0...',
 'h2 In the news',
 'h2 On this day',
 "h2 Today's featured picture",
 'h2 Other areas of Wikipedia',
 "h2 Wikipedia's sister projects",
 'h2 Wikipedia languages',
 'h2 Navigation menu',
 'h3 Personal tools',
 'h3 Namespaces',
 'h3 Views',
 'h3 Search',
 'h3 Navigation',
 'h3 Contribute',
 'h3 Tools',
 'h3 Print/export',
 'h3 In other projects',
 'h3 Languages']

Observation: Here, we write some code to fetch 'header' from wikipedia.com. Headers are the different title on webpage which gives information. There are six types of header like h1,h2,h3,h4,h5, and h6.

## Create a dataframe

In [23]:
# create data frame:
df = pd.DataFrame(header_tags, columns = ['Header'])
df

Unnamed: 0,Header
0,h1 Main Page
1,h1 Welcome to Wikipedia
2,h2 From today's featured article
3,h2 Did you know ...
4,h2 In the news
5,h2 On this day
6,h2 Today's featured picture
7,h2 Other areas of Wikipedia
8,h2 Wikipedia's sister projects
9,h2 Wikipedia languages


Observation: Now, our fetched data looks more informitive.

# Q3. Find all the headers tags from TOI

In [24]:
page3 = requests.get("https://timesofindia.indiatimes.com/business/india-business/isros-lvm3-to-make-commercial-foray-with-launch-of-36-oneweb-satellites-on-october-23/articleshow/94873095.cms")

soup = BeautifulSoup(page3.content)

header_tags = []

for i in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
    header_tags.append(i.name+" "+i.text.strip())
    
header_tags

['h3 Top Searches',
 'h2 TOI',
 "h1 Isro's LVM3 to make commercial foray with launch of 36 OneWeb satellites on October 23",
 'h4 ARTICLES',
 'h2 Visual Stories',
 'h2 Trending Stories',
 'h2 ',
 'h4 FOLLOW US ON',
 'h4 Other Times Group News Sites',
 'h4 Popular Categories',
 'h4 Hot on the Web',
 'h4 Top Trends',
 'h4 Trending Topics',
 'h4 Living and entertainment',
 'h4 Services']

## Create Data Frame:

In [25]:
# create data frame:
df = pd.DataFrame(header_tags, columns = ['Header'])
df

Unnamed: 0,Header
0,h3 Top Searches
1,h2 TOI
2,h1 Isro's LVM3 to make commercial foray with l...
3,h4 ARTICLES
4,h2 Visual Stories
5,h2 Trending Stories
6,h2
7,h4 FOLLOW US ON
8,h4 Other Times Group News Sites
9,h4 Popular Categories


# Q4. Find all the header tags from "idronline.org" and create DataFrame

In [26]:
# send request to the webpage server to get the source code of the page:
page4 = requests.get("https://idronline.org/whats-stopping-india-from-achieving-the-growth-it-wants/?gclid=Cj0KCQjwteOaBhDuARIsADBqRehtZbzSF4-85tFMxyg54DvWIA_pT1zJkSsAIFZG_LFUF4C3uNQhy4kaAso6EALw_wcB")

# page content:
soup = BeautifulSoup(page4.content)

# create empty:
header_tags = []

for i in soup.find_all(["h1","h2","h3","h4","h5","h6"]):
    header_tags.append(i.name+" "+i.text.strip())

# print all header_tags
header_tags

['h1 What’s stopping India from achieving the growth it wants',
 "h2 India won't be able to improve it's economy or develop sustainably unless it reduces gender and income inequality.",
 'h3 Gender imbalance',
 'h3 Widespread inequality',
 'h3 Why it is becoming increasingly important to invest in people',
 'h5 ABOUT',
 'h5 SECTORS',
 'h5 EXPERTISE',
 'h5 THEMES',
 'h5 FOLLOW US',
 'h2 Before you go...',
 "h3 Make sure you're always up to date with the latest thinking on social impact"]

## Create a dataframe:

In [27]:
# create data frame:
df = pd.DataFrame(header_tags, columns = ['Header'])
df

Unnamed: 0,Header
0,h1 What’s stopping India from achieving the gr...
1,h2 India won't be able to improve it's economy...
2,h3 Gender imbalance
3,h3 Widespread inequality
4,h3 Why it is becoming increasingly important t...
5,h5 ABOUT
6,h5 SECTORS
7,h5 EXPERTISE
8,h5 THEMES
9,h5 FOLLOW US


## Q5. Find all the header tags from "britannica.com" and create a dataframe

In [28]:
# send get request to the webpage server to get the source code of the page:
page5 = requests.get("https://www.britannica.com/science/helium-chemical-element")

# page content:
soup = BeautifulSoup(page5.content)

# create empty list:
header_tags = []

for i in soup.find_all(["h1","h2","h3","h4","h5","h6"]):
    header_tags.append(i.name+" "+i.text.strip())
    
# print all header_tags:
header_tags

['h1 helium',
 'h3 Read a brief summary of this topic',
 'h2 History',
 'h2 Abundance and isotopes',
 'h2 Properties',
 'h2 Production and uses']

## Create a dataframe

In [29]:
# create data frame:
df = pd.DataFrame(header_tags, columns = ['Header'])
df

Unnamed: 0,Header
0,h1 helium
1,h3 Read a brief summary of this topic
2,h2 History
3,h2 Abundance and isotopes
4,h2 Properties
5,h2 Production and uses


# Q6. Fetch Top rated 100 movies from IMDB websits and make a dataframe 

In [30]:
# Send the request to the webpage to get the source code of the page
url = "https://www.imdb.com/list/ls091520106/"
page6 = requests.get(url)

# See page content
soup = BeautifulSoup(page6.content)

# Top Movies name
name = soup.find_all('h3', class_='lister-item-header')

# get text from movie name web elements
# empty list
movies_name = []

for i in name:
    for j in i.find_all("a"):
        movies_name.append(j.text)
        

## Fetching year of release and rating

In [31]:
# Year of release
year = soup.find_all("span", class_="lister-item-year text-muted unbold")

# empty list
year_of_release = []
for l in year:
    a=l.text.replace('(', '')
    year_of_release.append(a.replace(')', ''))
    
# Featching ratings of movies
rating = soup.find_all("div",class_ = "ipl-rating-star small")

# Scrape text from rating web element
# empty list 
IMDB_rating = []

for i in rating:
    IMDB_rating.append(float(i.text))

## Create dataframe 

In [32]:
IMDB_top_100 = pd.DataFrame({})
IMDB_top_100['Movies_name']=movies_name
IMDB_top_100['Year_of_release']=year_of_release
IMDB_top_100['IMDB_rating']=IMDB_rating
IMDB_top_100

Unnamed: 0,Movies_name,Year_of_release,IMDB_rating
0,The Shawshank Redemption,1994,9.3
1,The Godfather,1972,9.2
2,The Godfather Part II,1974,9.0
3,The Dark Knight,2008,9.0
4,12 Angry Men,1957,9.0
...,...,...,...
95,North by Northwest,1959,8.3
96,A Clockwork Orange,1971,8.3
97,Snatch,2000,8.2
98,Le fabuleux destin d'Amélie Poulain,2001,8.3


# Q7. Fetch Top rated 100 Indian movies from IMDB website and make a dataframe 

In [33]:
# Send request to the webpage server to get the source code of the page
url = "https://www.imdb.com/india/top-rated-indian-movies/"
page7 = requests.get(url)

# Check the page content
soup = BeautifulSoup(page7.content)

In [34]:
# Top Movies name
name = soup.find_all("td",class_="titleColumn")

# get text from movie name web elements
# create empty list
movies_name = []
for i in name:
    for t in i.find_all("a"):
        movies_name.append(t.text)

In [35]:
# Year of release
year = soup.find_all("span", class_ = "secondaryInfo")

# Create empty list
year_of_release = []
for n in year:
    a = n.text.replace('(', '')
    year_of_release.append(a.replace(')', ''))

In [36]:
# IDBM rating
IDBM = soup.find_all("td", class_ = "ratingColumn imdbRating")

# Create empty list and append rating information 
IMDB_rating = []
for i in rating:
    IMDB_rating.append(float(i.text))

Observation: Creating an empty list and store movies rating into that list

In [37]:
print(len(movies_name),len(year_of_release),len(IMDB_rating))

250 250 100


Observation: Checking the length of each list

## DataFrame

In [38]:
# Create dataframe
indian_top_100=pd.DataFrame({})
indian_top_100['Movies_name']=movies_name[:100]
indian_top_100['Year_of_release']=year_of_release[:100]
indian_top_100['IMDB_rating']=IMDB_rating
indian_top_100 

Unnamed: 0,Movies_name,Year_of_release,IMDB_rating
0,Kantara,2022,9.3
1,Ramayana: The Legend of Prince Rama,1993,9.2
2,Rocketry: The Nambi Effect,2022,9.0
3,Nayakan,1987,9.0
4,Anbe Sivam,2003,9.0
...,...,...,...
95,Ustad Hotel,2012,8.3
96,Theeran Adhigaaram Ondru,2017,8.3
97,Rang De Basanti,2006,8.2
98,Baahubali 2: The Conclusion,2017,8.3


Observation: Now, we can clearly see our data in better shape to understand.

In [39]:
indian_top_100.head(20)

Unnamed: 0,Movies_name,Year_of_release,IMDB_rating
0,Kantara,2022,9.3
1,Ramayana: The Legend of Prince Rama,1993,9.2
2,Rocketry: The Nambi Effect,2022,9.0
3,Nayakan,1987,9.0
4,Anbe Sivam,2003,9.0
5,Golmaal,1979,9.0
6,777 Charlie,2022,9.0
7,Jai Bhim,2021,8.9
8,Pariyerum Perumal,2018,8.8
9,3 Idiots,2009,8.8


In [40]:
indian_top_100.tail(20)

Unnamed: 0,Movies_name,Year_of_release,IMDB_rating
80,Queen,2013,8.3
81,Mandela,2021,8.3
82,Article 15,2019,8.3
83,Talvar,2015,8.4
84,Hera Pheri,2000,8.3
85,PK,2014,8.3
86,Soodhu Kavvum,2013,8.3
87,OMG: Oh My God!,2012,8.3
88,Sarfarosh,1999,8.3
89,Sholay,1975,8.3


# Q8 Fetch Top rated 100 South Indian movies from IMDB website and make a dataframe 

In [41]:
# Send get request to the webpage server to get the source code of the page
url = "https://www.imdb.com/list/ls570577287/"
page8 = requests.get(url)

# Check the page content
soup = BeautifulSoup(page8.content)

Observation: Use desiered url and send request for getting page contents 

In [42]:
# Top Movies Name
name = soup.find_all("h3", class_ = "lister-item-header")

# Creating an empty list and store data into that list
movies_name = []
for i in name:
    for r in i.find_all("a"):
        movies_name.append(r.text)

Observation: Fetch movies name and store into empty movies_name list.

In [43]:
# Year of release
year = soup.find_all('span', class_ = "lister-item-year text-muted unbold")

# Creating an empty list and store data into that list
year_of_release = []
for t in year:
    a = t.text.replace('(','')
    year_of_release.append(a.replace(')', ''))

Observation: Fetch year of release and store into empty list after filter the data.

In [44]:
# Movies rating
rating = soup.find_all("div", class_ = "ipl-rating-star small")

# Creating rating empty list and store data into that list
IMDB_rating = []
for i in rating:
    IMDB_rating.append(float(i.text))

Observation: Fetch IMDB ratings of movies and store data into empty list.

In [45]:
# Director
director = soup.find_all("p", class_ = "text-muted text-small")

# Creating director empty list and store data into that list
Dir = []
for i in director:
    for k in i.find_all("a"):
        Dir.append(k.text)

Observation: Fetch directors name and store data into empty list.

In [46]:
# Votes
Votes = soup.find_all("span", class_ = "text-muted")

# Creating votes empty list and store data into that list
votes = []
for j in Votes:
    votes.append(j.text)

Observation: Fetch number of votes getting by movies and store data into empty list.

In [47]:
print(len(movies_name),len(year_of_release),len(IMDB_rating),len(Dir),len(votes))

100 100 100 504 332


Observation: Checking length of all data which we fetch above.

## Create a Dataframe

In [48]:
south_indian_top_100 = pd.DataFrame({})
south_indian_top_100['South Indian Movies'] = movies_name
south_indian_top_100['Release Year'] = year_of_release
south_indian_top_100['IMDB Rating'] = IMDB_rating
south_indian_top_100['Director'] = Dir[:100]
south_indian_top_100['Votes'] = votes[:100]
south_indian_top_100.head(20)

Unnamed: 0,South Indian Movies,Release Year,IMDB Rating,Director,Votes
0,K.G.F: Chapter 1,2018,8.2,Prashanth Neel,created - 11 months ago
1,Jai Bhim,2021,8.9,Yash,updated - 11 months ago
2,Bãhubali: The Beginning,2015,8.0,Srinidhi Shetty,\n Public\n
3,Baahubali 2: The Conclusion,2017,8.2,Ramachandra Raju,"See titles to watch instantly, titles you have..."
4,Asuran,2019,8.5,Archana Jois,(59)
5,Vada Chennai,2018,8.4,T.J. Gnanavel,(28)
6,Sarileru Neekevvaru,2020,5.8,Suriya,(140)
7,Ghajini,2005,7.5,Lijo Mol Jose,(116)
8,Tughlaq Durbar,2021,5.4,Manikandan K.,(85)
9,Master,2021,7.3,Rajisha Vijayan,(51)


Observation: Now, we can see data in better understanding way. This dataframe shows data into tabular form. Here, we use 'head' method to see top 20 rows.

In [49]:
df = south_indian_top_100
df.tail(20)

Unnamed: 0,South Indian Movies,Release Year,IMDB Rating,Director,Votes
80,Singam,2010,6.9,Jeethu Joseph,(6)
81,Singam 3,2017,6.0,Mohanlal,(6)
82,Singam 2,2013,6.3,Meena,(6)
83,Maattrraan,2012,6.1,Ansiba,(6)
84,NGK,2019,5.8,Esther Anil,(6)
85,Ponmagal Vandhal,2020,6.7,Manu Ashokan,(6)
86,Kaappaan,2019,6.2,Tovino Thomas,(6)
87,Rakhta Charitra,2010,7.6,Samyuktha Menon,(6)
88,Anjaan,2014,5.1,Parvathy Thiruvothu,(6)
89,Pithamagan,2003,8.3,Asif Ali,(6)


Observation: Here we use 'tail' method to see bottom 20 rows.

# Q9. Fetch Top rated 100 American movies from IMDB website and make a dataframe 

In [50]:
# send get requests to the webpage server to get the source code of the page
url = 'https://www.imdb.com/list/ls002981281/'
page9 = requests.get(url)

# check the page content
soup = BeautifulSoup(page9.content)

In [51]:
# Top Movies Name
name = soup.find_all("h3", class_ = "lister-item-header")

# get text from movie name web elements
# create empty list to store movies name
movies_name = []
for i in name:
    for u in i.find_all("a"):
        movies_name.append(u.text)

In [52]:
# Year of release
year = soup.find_all('span', class_ = "lister-item-year text-muted unbold")

# create empty list to store year of release
year_of_release = []
for t in year:
    p=t.text.replace('(','')
    year_of_release.append(p.replace(')',''))

In [53]:
# Movies ratings
IMDB_rating = soup.find_all('div', class_ = "ipl-rating-star small")

# create empty list to store movie ratings
ratings = []
for i in IMDB_rating:
    ratings.append(float(i.text))

In [54]:
# Movie time
time = soup.find_all('span', class_ = "runtime")

# create empty list to store movie time
movie_time = []
for p in time:
    movie_time.append(p.text)

In [55]:
# Director
director = soup.find_all("p", class_ = "text-muted text-small")

# create director empty list to store directors name
dire = []
for i in director:
    for k in i.find_all("a"):
        dire.append(k.text)

In [56]:
print(len(movies_name),len(year_of_release),len(ratings),len(movie_time),len(dire))

100 100 100 100 511


## Create a DataFrame

In [57]:
American_Top_100_movies = pd.DataFrame({})
American_Top_100_movies['Movie'] = movies_name
American_Top_100_movies['Release'] = year_of_release
American_Top_100_movies['Ratings'] = ratings
American_Top_100_movies['Duration'] = movie_time
American_Top_100_movies['Director'] = dire[:100]
American_Top_100_movies

Unnamed: 0,Movie,Release,Ratings,Duration,Director
0,The Godfather,1972,9.2,175 min,Francis Ford Coppola
1,The Godfather Part II,1974,9.0,202 min,Marlon Brando
2,One Flew Over the Cuckoo's Nest,1975,8.7,133 min,Al Pacino
3,Psycho,1960,8.5,109 min,James Caan
4,The Birds,1963,7.6,119 min,Diane Keaton
...,...,...,...,...,...
95,Inherit the Wind,1960,8.1,128 min,Marlon Brando
96,Airplane!,1980,7.7,88 min,Kim Hunter
97,Yellow Sky,1948,7.4,98 min,Karl Malden
98,Shock Corridor,1963,7.3,101 min,Mike Nichols


In [58]:
American_Top_100_movies.sample(10)

Unnamed: 0,Movie,Release,Ratings,Duration,Director
18,A Streetcar Named Desire,1951,7.9,122 min,Vera Miles
46,Singin' in the Rain,1952,8.3,103 min,Michael Curtiz
13,Rear Window,1954,8.5,112 min,Michael Berryman
69,Chinatown,1974,8.2,130 min,Wendell Corey
22,Vertigo,1958,8.3,128 min,Tippi Hedren
96,Airplane!,1980,7.7,88 min,Kim Hunter
57,How Green Was My Valley,1941,7.7,118 min,Morgan Freeman
92,Cat on a Hot Tin Roof,1958,7.9,108 min,Barbara O'Neil
37,The Maltese Falcon,1941,8.0,100 min,Joseph Cotten
66,The Little Foxes,1941,7.9,116 min,Alfred Hitchcock


In [59]:
American_Top_100_movies.head(20)

Unnamed: 0,Movie,Release,Ratings,Duration,Director
0,The Godfather,1972,9.2,175 min,Francis Ford Coppola
1,The Godfather Part II,1974,9.0,202 min,Marlon Brando
2,One Flew Over the Cuckoo's Nest,1975,8.7,133 min,Al Pacino
3,Psycho,1960,8.5,109 min,James Caan
4,The Birds,1963,7.6,119 min,Diane Keaton
5,Forrest Gump,1994,8.8,142 min,Francis Ford Coppola
6,Sunrise: A Song of Two Humans,1927,8.1,94 min,Al Pacino
7,Citizen Kane,1941,8.3,119 min,Robert De Niro
8,Scarface,1932,7.7,93 min,Robert Duvall
9,Casablanca,1942,8.5,102 min,Diane Keaton


In [60]:
American_Top_100_movies.tail(10)

Unnamed: 0,Movie,Release,Ratings,Duration,Director
90,A Night at the Opera,1935,7.8,96 min,Vivien Leigh
91,Apocalypse Now,1979,8.5,147 min,Thomas Mitchell
92,Cat on a Hot Tin Roof,1958,7.9,108 min,Barbara O'Neil
93,Doctor Zhivago,1965,7.9,197 min,Elia Kazan
94,In a Lonely Place,1950,7.9,94 min,Vivien Leigh
95,Inherit the Wind,1960,8.1,128 min,Marlon Brando
96,Airplane!,1980,7.7,88 min,Kim Hunter
97,Yellow Sky,1948,7.4,98 min,Karl Malden
98,Shock Corridor,1963,7.3,101 min,Mike Nichols
99,Sherlock Jr.,1924,8.2,45 min,Elizabeth Taylor


# Q10. Fetch Top 100 Best Films of the Last 5 Years (2015-2019) and make a Data Frame

In [61]:
# send get requests to the webpage server to get the source code of the page
url = 'https://www.imdb.com/list/ls047884000/'
page10 = requests.get(url)

# check the page content
soup = BeautifulSoup(page10.content)

In [62]:
# Top Movies Name
name = soup.find_all("h3", class_ = "lister-item-header")

# get text from movie name web elements
# create empty list to store movies name
movies_name = []
for i in name:
    for p in i.find_all("a"):
        movies_name.append(p.text)

In [63]:
# Year of release
year = soup.find_all('span', class_ = "lister-item-year text-muted unbold")

# create empty list to store year of release
year_of_release = []
for t in year:
    p=t.text.replace('(','')
    year_of_release.append(p.replace(')',''))

In [64]:
# Movies ratings
IMDB_rating = soup.find_all('div', class_ = "ipl-rating-star small")

# create empty list to store movie ratings
ratings = []
for i in IMDB_rating:
    ratings.append(float(i.text))

In [65]:
# Movie duration
time = soup.find_all('span', class_ = "runtime")

# create empty list to store movie time
movie_time = []
for p in time:
    movie_time.append(p.text)

In [66]:
# Director
director = soup.find_all("p", class_ = "text-muted text-small")

# create director empty list to store directors name
dire = []
for i in director:
    for k in i.find_all("a"):
        dire.append(k.text)

In [67]:
print(len(movies_name),len(year_of_release),len(ratings),len(movie_time),len(dire))

100 100 100 100 504


# Data Frame

In [68]:
Top_100_Best_movies = pd.DataFrame({})
Top_100_Best_movies['Movie'] = movies_name
Top_100_Best_movies['Release'] = year_of_release
Top_100_Best_movies['Ratings'] = ratings
Top_100_Best_movies['Duration'] = movie_time
Top_100_Best_movies['Director'] = dire[:100]
Top_100_Best_movies

Unnamed: 0,Movie,Release,Ratings,Duration,Director
0,The Revenant,2015,8.0,156 min,Alejandro G. Iñárritu
1,Contratiempo,2016,8.0,106 min,Leonardo DiCaprio
2,Green Book,2018,8.2,130 min,Tom Hardy
3,Hacksaw Ridge,2016,8.1,139 min,Will Poulter
4,Molly's Game,2017,7.4,140 min,Domhnall Gleeson
...,...,...,...,...,...
95,Juzni vetar,2018,7.9,130 min,Albert Dupontel
96,Mission: Impossible - Fallout,2018,7.7,147 min,Nahuel Pérez Biscayart
97,Den of Thieves,2018,7.0,140 min,Albert Dupontel
98,The Guernsey Literary and Potato Peel Pie Society,2018,7.3,124 min,Laurent Lafitte


In [69]:
df = Top_100_Best_movies
df.head(20)

Unnamed: 0,Movie,Release,Ratings,Duration,Director
0,The Revenant,2015,8.0,156 min,Alejandro G. Iñárritu
1,Contratiempo,2016,8.0,106 min,Leonardo DiCaprio
2,Green Book,2018,8.2,130 min,Tom Hardy
3,Hacksaw Ridge,2016,8.1,139 min,Will Poulter
4,Molly's Game,2017,7.4,140 min,Domhnall Gleeson
5,Logan,2017,8.1,137 min,Oriol Paulo
6,The Accountant,2016,7.3,128 min,Mario Casas
7,Lion,2016,8.0,118 min,Ana Wagener
8,Mad Max: Fury Road,2015,8.1,120 min,Jose Coronado
9,Drishyam,2015,8.2,163 min,Bárbara Lennie


In [70]:
df.sample(20)

Unnamed: 0,Movie,Release,Ratings,Duration,Director
18,Daeho,2015,7.2,139 min,Luke Bracey
23,The Martian,2015,8.0,144 min,Kevin Costner
85,Mr. Church,2016,7.6,104 min,Sotiris Tsafoulias
16,Bohemian Rhapsody,2018,7.9,134 min,Andrew Garfield
0,The Revenant,2015,8.0,156 min,Alejandro G. Iñárritu
97,Den of Thieves,2018,7.0,140 min,Albert Dupontel
29,Veloce come il vento,2016,7.2,118 min,Boyd Holbrook
10,Gifted,2017,7.6,101 min,Peter Farrelly
24,A Dog's Purpose,2017,7.2,100 min,Michael Cera
15,Andhadhun,2018,8.2,139 min,Mel Gibson


In [71]:
df.tail(20)

Unnamed: 0,Movie,Release,Ratings,Duration,Director
80,Mowgli,2018,6.5,104 min,Bryan Singer
81,Suburra,2015,7.4,130 min,Rami Malek
82,Palmeras en la nieve,2015,7.3,163 min,Lucy Boynton
83,Warcraft,2016,6.7,123 min,Gwilym Lee
84,Deepwater Horizon,2016,7.1,107 min,Ben Hardy
85,Mr. Church,2016,7.6,104 min,Sotiris Tsafoulias
86,The Founder,2016,7.2,115 min,Pigmalion Dadakaridis
87,Alpha,II 2018,6.6,96 min,Dimitris Katalifos
88,L'ascension,I 2017,6.9,103 min,Manos Vakousis
89,Hidden Figures,2016,7.8,127 min,Ioanna Kolliopoulou
