## Web Scraping
Web Scraping is the extraction of data from a website, and in this case, the Python library called **Beautiful Soup** will be used. The scraper loads the HTML code of the page the user wants to collect data from, then the scraper will either extract all the data on the page or the user will go through the process of selecting the specific data they want from the page. That is done by looking at the website’s HTML code and selecting the the specific element or tag that the desired information is in. 

### Data to Scrape
In this practical we will look at how to do web scraping on imdb.com to fetch information about movies with different genres using Python BeautifulSoup and requests. IMDB (Internet Movie Database) website is owned by Amazon, is one of the best platforms for finding information about films, television shows, web series, etc.

The data that we want to extract from it are:
* Movie title
* Star and Count
* Metascore
* Description


To extract all of this data, our scrapper will need to go inside each film’s webpage. Now let's start scrapping.

## Load Libraries
Before we begin, we need to import the libraries that will be used for this practical.

In [9]:
pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-2.2.3-cp39-cp39-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting numpy>=1.22.4 (from pandas)
  Downloading numpy-2.0.2-cp39-cp39-macosx_14_0_arm64.whl.metadata (60 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp39-cp39-macosx_11_0_arm64.whl (11.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading numpy-2.0.2-cp39-cp39-macosx_14_0_arm64.whl (5.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Getting URLs of different pages
The first thing we need to do is to get URLs of different movie genres, for example, the genres include Adventure, Animation, Drama, Comedy, Horror, etc.


In [3]:
# URLs for different genres
genres = ["Adventure","Action","Biography"]

url_dict = {}
for genre in genres:
    formatted_url = f"https://www.imdb.com/search/title/?genres={genre}"
    url_dict[genre] = formatted_url

print(url_dict)

{'Adventure': 'https://www.imdb.com/search/title/?genres=Adventure', 'Action': 'https://www.imdb.com/search/title/?genres=Action', 'Biography': 'https://www.imdb.com/search/title/?genres=Biography'}


## Parsing Movie Information
Now let's parse the movie information from IMDB. We will work with one genre first.

In [11]:
url = "https://www.imdb.com/search/title/?genres=Adventure" 
 
headers = { 
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) ' 
                  'AppleWebKit/537.36 (KHTML, like Gecko) ' 
                  'Chrome/50.0.2661.102 Safari/537.36' 
} 
 
# Sending a request to the speciifed URL 
result = requests.get(url, headers=headers) 
print(result.status_code)

200


In [12]:
# Converting the response to Beautiful Soup Object 
content = BeautifulSoup(result.content, 'html')

## Creating a scraping function
Now let's create a function that does the same as above but it can be reused several times for different URLs.

In [13]:
def get_movies(url): 
 
        headers = { 
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) ' 
                      'AppleWebKit/537.36 (KHTML, like Gecko) ' 
                      'Chrome/50.0.2661.102 Safari/537.36' 
        } 
        result = requests.get(url, headers=headers) 
         
        content = BeautifulSoup(result.content, 'html') 
         
        movie_list=content.find_all('li',class_='ipc-metadata-list-summary-item') 
         
        m_list=[] 
 
        # Iterating throught the list of movies  
        for movie in movie_list: 
            title=movie.find('h3',class_='ipc-title__text').get_text() 
         
         
            star=movie.find('span',class_='ipc-rating-star--rating') 
            if star is None: 
                star="" 
            else: 
                star=star.get_text() 
         
         
            metascore=movie.find('span',class_="sc-ae9e80c5-0 gXcoKx metacritic-score-box") 
            if metascore is None: 
                metascore="" 
            else: 
                metascore=metascore.get_text() 
          
             
            description=movie.find('div',class_='ipc-html-content-inner-div').get_text() 
 
 
            data={ 
                "title":title, 
                "star":star, 
                "metaScore":metascore, 
                "description":description 
            } 
         
            m_list.append(data) 
         
        return pd.DataFrame(m_list)

In [15]:
url = "https://www.imdb.com/search/title/?genres=Adventure" 
 
# Calling the function 
get_movies(url) 

Unnamed: 0,title,star,metaScore,description
0,1. The Last of Us,8.7,,"After a global pandemic destroys civilization,..."
1,2. Thunderbolts*,7.6,68.0,After finding themselves ensnared in a death t...
2,3. Andor,8.4,,"In an era filled with danger, deception, and i..."
3,4. El Eternauta,7.6,,Follows Juan Salvo along with a group of survi...
4,5. Game of Thrones,9.2,,Nine noble families fight for control over the...
5,6. A Minecraft Movie,5.8,45.0,Four misfits are suddenly pulled through a mys...
6,7. The Old Guard 2,,,Andy leads immortal warriors against a powerfu...
7,8. Mission: Impossible - The Final Reckoning,,,Our lives are the sum of our choices. Tom Crui...
8,9. Doctor Who,6.2,,The Time Lord known as the Doctor travels thro...
9,10. Snow White,1.6,50.0,A princess joins forces with seven dwarfs and ...


## Scraping movies of different genres
The **get_movies()** function we write above can parse details from the IMDB web page of different genre URLs and can save them as a CSV file. So by using this function it is possible to scrape all genres that can be saved as separate CSV files. So let's see how this can be done.

In [16]:
df_data = pd.DataFrame() 
 
for genre, url in url_dict.items(): 
    df_data = pd.concat([df_data, get_movies(url)]) 
     
df_data.to_csv('movies.csv') 

In [19]:
# print content
print(df_data.head(10))

                                          title star metaScore  \
0                             1. The Last of Us  8.7             
1                              2. Thunderbolts*  7.6        68   
2                                      3. Andor  8.4             
3                               4. El Eternauta  7.6             
4                            5. Game of Thrones  9.2             
5                          6. A Minecraft Movie  5.8        45   
6                            7. The Old Guard 2                  
7  8. Mission: Impossible - The Final Reckoning                  
8                                 9. Doctor Who  6.2             
9                                10. Snow White  1.6        50   

                                         description  
0  After a global pandemic destroys civilization,...  
1  After finding themselves ensnared in a death t...  
2  In an era filled with danger, deception, and i...  
3  Follows Juan Salvo along with a group of survi... 