## Web Scraping IMDb Top 250 Movies

### Table of Contents
- Purpose/objective <br>
- Tools Used <br>
- Workflow <br>
- Code Implementation <br>
- Results <br>
- Conclusion <br>
- Showcases

### Purpose/Objective:
<div style='align-text: justify'>The purpose of this project is to demonstrate proficiency in web scraping using Python libraries such as BeautifulSoup and requests. The objective is to extract the IMDb Top 250 movies from the IMDb website, create a structured dataset, and save it as a CSV file. This project serves as a showcase of web scraping skills for potential employers.
</div>


### Tools Used:
##### Python:
The programming language used for the entire project.
##### BeautifulSoup:
A Python library for parsing HTML and XML documents.
##### Requests:
A Python library for making HTTP requests.
##### Pandas: 
A Python library for data manipulation.
#####  CSV:
A file format used for storing tabular data.

### Workflow:
##### 1. Scraping IMDb Top 250 List 
The project starts by sending an HTTP GET request to the IMDb Top 250 list page using the requests library. BeautifulSoup is then used to parse the HTML content of the page and extract relevant information such as movie rank, title, release_year, and rating vote.
##### 2. Data Processing 
After extracting the movie data, it is processed to split the "rank_title" into "rank" and "title", and "rating_votes" into "rating" and "votes" for better organization.
##### 3. Creating DataFrame 
The processed data is structured into a pandas DataFrame named 'Top_movies'.
##### 4. Exporting to CSV
Finally, the 'Top_movies' DataFrame is exported to a CSV file named 'imdb_top_movies.csv' for further analysis or storage.
##### 5. Code Implementation
The Python code consists of several steps, including importing necessary libraries, sending HTTP requests, parsing HTML content, processing data, creating a DataFrame, and exporting to CSV. Error handling is also included to manage exceptions that may occur during the process.

### Showcases

#### 1. Scraping Data

##### Import the required Python libraries for the scraping, creating DataFrame, and exporting to the CSV file process.

In [8]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

##### Scrap IMDb top 250 movies list.

In [26]:
# URL of the IMDb Top 250 list
url = 'https://www.imdb.com/chart/top/'

# Headers to mimic a browser request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# Class names of elements containing movie metadata and details
ul_class = "ipc-metadata-list ipc-metadata-list--dividers-between sc-a1e81754-0 eBRbsI compact-list-view ipc-metadata-list--base"
div_class = "sc-b189961a-0 hBZnfJ cli-children"

try:
    # Send an HTTP GET request to the URL with headers
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # Raise an exception for HTTP errors

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all movie <div> elements within the specified <ul> element
    movies = soup.find('ul', class_=ul_class).find_all('div', class_=div_class)

    # Iterate over each movie <div> element
    for movie in movies:
        # Find the <h3> element containing movie title
        rank_title = movie.find('h3', class_="ipc-title__text").text
        
        # Find the <div> element containing release year and then find the <span> within it
        release_year = movie.find('div', class_="sc-b189961a-7 feoqjK cli-title-metadata").span.text
        
        # Find the <span> element containing rating votes
        rating_votes = movie.find('span', class_="sc-b189961a-1 kcfvgk").span.text

        # Print movie details
        print("Rank and Title:", rank_title)
        print("Release_Year:", release_year)
        print("Rating and Votes:", rating_votes)
  
        # Stop after the first iteration     
        break
except Exception as e:
    # Catch any exceptions and print the error message
    print(e)


Rank and Title: 1. The Shawshank Redemption
Release_Year: 1994
Rating and Votes: 9.3 (2.9M)


##### Split rank_title as "rank" and "title" , and rating_votes as "rating" and "votes".

In [28]:


url = 'https://www.imdb.com/chart/top/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

ul_class="ipc-metadata-list ipc-metadata-list--dividers-between sc-a1e81754-0 eBRbsI compact-list-view ipc-metadata-list--base"
li_class="ipc-metadata-list-summary-item sc-10233bc-0 iherUv cli-parent"
div_class = "sc-b189961a-0 hBZnfJ cli-children" 
try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')
    movies = soup.find('ul', class_=ul_class).find_all('div', class_= div_class)
    
    for movie in movies:
        rank = movie.find('h3', class_="ipc-title__text").text.split('.')[0]
        title= movie.find('h3', class_="ipc-title__text").text.split('.')[-1]
        release_year = movie. find('div', class_="sc-b189961a-7 feoqjK cli-title-metadata").span.text
        rating = movie.find('span', class_="sc-b189961a-1 kcfvgk").span.text.split()[0]
        votes = movie.find('span', class_="sc-b189961a-1 kcfvgk").span.text.split()[-1].strip('()')
        
        print("Rank:", rank)
        print("Title:", title)
        print("Year:",release_year)
        print("Rating", rating) 
        print("Vote",votes)
        break
        
        
except Exception as e:
    print(e)


Rank: 1
Title:  The Shawshank Redemption
Year: 1994
Rating 9.3
Vote 2.9M


##### Display the elements in a single line.

In [29]:
url = 'https://www.imdb.com/chart/top/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

ul_class = "ipc-metadata-list ipc-metadata-list--dividers-between sc-a1e81754-0 eBRbsI compact-list-view ipc-metadata-list--base"
li_class = "ipc-metadata-list-summary-item sc-10233bc-0 iherUv cli-parent"
div_class = "sc-b189961a-0 hBZnfJ cli-children" 

try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')
    movies = soup.find('ul', class_=ul_class).find_all('div', class_=div_class)
    
    for movie in movies:
        # Find the <h3> element containing movie rank and title, then split it and get the rank value 
        rank = movie.find('h3', class_="ipc-title__text").text.split('.')[0]
        
        #Find the <h3> element containing movie rank and title, then split it and get the title value 
        title = movie.find('h3', class_="ipc-title__text").text.split('.')[-1]
        
        # Find the <div> element containing release year and extract the text
        release_year = movie.find('div', class_="sc-b189961a-7 feoqjK cli-title-metadata").span.text
        
        # Find the <span> element containing the rating and votes, split it and get the rating value
        rating = movie.find('span', class_="sc-b189961a-1 kcfvgk").span.text.split()[0]
        
        # Find the <span> element containing the rating and votes, split it and get the votes count
        vote = movie.find('span', class_="sc-b189961a-1 kcfvgk").span.text.split()[-1].strip('()')
        
        # Print the rank, title, release year, rating, and votes
        print("Rank:", rank)
        print("Title:", title)
        print("Year:",release_year)
        print("Rating", rating) 
        print("Votes", vote)
        
        # Stop after processing the first movie
        break
        
except Exception as e:
    # Print any exception that occurs during the process
    print(e)


Rank: 1
Title:  The Shawshank Redemption
Year: 1994
Rating 9.3
Votes 2.9M


##### Display all the top-rated movies.

In [5]:
url = 'https://www.imdb.com/chart/top/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

ul_class="ipc-metadata-list ipc-metadata-list--dividers-between sc-a1e81754-0 eBRbsI compact-list-view ipc-metadata-list--base"
li_class="ipc-metadata-list-summary-item sc-10233bc-0 iherUv cli-parent"
div_class = "sc-b189961a-0 hBZnfJ cli-children" 
try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')
    movies = soup.find('ul', class_=ul_class).find_all('div', class_= div_class)
    
    
    for movie in movies:
        rank = movie.find('h3', class_="ipc-title__text").text.split('.')[0]
        title= movie.find('h3', class_="ipc-title__text").text.split('.')[-1]
        release_year = movie. find('div', class_="sc-b189961a-7 feoqjK cli-title-metadata").span.text
        rating = movie.find('span', class_="sc-b189961a-1 kcfvgk").span.text.split()[0]
        votes = movie.find('span', class_="sc-b189961a-1 kcfvgk").span.text.split()[-1].strip('()')
       # Display the elements in a single line 
        print(rank, title,release_year, rating, votes)
        
            
except Exception as e:
    print(e)


1  The Shawshank Redemption 1994 9.3 2.9M
2  The Godfather 1972 9.2 2M
3  The Dark Knight 2008 9.0 2.9M
4  The Godfather Part II 1974 9.0 1.4M
5  12 Angry Men 1957 9.0 866K
6  Schindler's List 1993 9.0 1.5M
7  The Lord of the Rings: The Return of the King 2003 9.0 2M
8  Pulp Fiction 1994 8.9 2.2M
9  The Lord of the Rings: The Fellowship of the Ring 2001 8.9 2M
10  The Good, the Bad and the Ugly 1966 8.8 812K
11  Forrest Gump 1994 8.8 2.3M
12  The Lord of the Rings: The Two Towers 2002 8.8 1.8M
13  Fight Club 1999 8.8 2.3M
14  Inception 2010 8.8 2.6M
15  Star Wars: Episode V - The Empire Strikes Back 1980 8.7 1.4M
16  The Matrix 1999 8.7 2.1M
17  Goodfellas 1990 8.7 1.3M
18  One Flew Over the Cuckoo's Nest 1975 8.7 1.1M
19  Se7en 1995 8.6 1.8M
20  Interstellar 2014 8.7 2.1M
21  It's a Wonderful Life 1946 8.6 499K
22  Dune: Part Two 2024 8.7 384K
23  Seven Samurai 1954 8.6 367K
24  The Silence of the Lambs 1991 8.6 1.5M
25  Saving Private Ryan 1998 8.6 1.5M
26  City of God 2002 8.6 801K


### 2. Creating a DataFrame

##### Create a data frame of the web-scraping top_rated_movies as 'Top_movies'.

In [30]:

url = 'https://www.imdb.com/chart/top/'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

ul_class = "ipc-metadata-list ipc-metadata-list--dividers-between sc-a1e81754-0 eBRbsI compact-list-view ipc-metadata-list--base"

div_class = "sc-b189961a-0 hBZnfJ cli-children" 

try:
    
    response = requests.get(url, headers=headers)
    response.raise_for_status()  


    soup = BeautifulSoup(response.text, 'html.parser')
    movies = soup.find('ul', class_=ul_class).find_all('div', class_=div_class)
    
    # Initialize lists to store movie details
    ranks = []
    titles = []
    years = []
    ratings = []
    votes = []

    
    for movie in movies:
        rank = movie.find('h3', class_="ipc-title__text").text.split('.')[0]
        title = movie.find('h3', class_="ipc-title__text").text.split('.')[-1].strip()
        release_year = movie.find('div', class_="sc-b189961a-7 feoqjK cli-title-metadata").span.text.strip()
        rating = movie.find('span', class_="sc-b189961a-1 kcfvgk").span.text.split()[0]
        vote_count = movie.find('span', class_="sc-b189961a-1 kcfvgk").span.text.split()[-1].strip('()')

        # Append movie details to respective lists
        ranks.append(rank)
        titles.append(title)
        years.append(release_year)
        ratings.append(rating)
        votes.append(vote_count)
    
    # Create a DataFrame from the collected movie details
    Top_movies = pd.DataFrame({'Rank': ranks, 'Title': titles, 'Release_Year': years, 'Rating': ratings, 'Vote_Count': votes})
    print(Top_movies)
        
except Exception as e:
    # Print any exceptions that occur during the process
    print(e)


    Rank                     Title Release_Year Rating Vote_Count
0      1  The Shawshank Redemption         1994    9.3       2.9M
1      2             The Godfather         1972    9.2         2M
2      3           The Dark Knight         2008    9.0       2.9M
3      4     The Godfather Part II         1974    9.0       1.4M
4      5              12 Angry Men         1957    9.0       866K
..   ...                       ...          ...    ...        ...
245  246                  The Help         2011    8.1       493K
246  247     It Happened One Night         1934    8.1       112K
247  248                   Aladdin         1992    8.0       467K
248  249        Dances with Wolves         1990    8.0       291K
249  250              Paris, Texas         1984    8.1       119K

[250 rows x 5 columns]


### 3.Exporting to  CSV File

##### Create a CSV file of the 'Top_movies' data frame as 'imdb_top_movies.csv'.

In [9]:

# Save the DataFrame 'Top_movies' to a CSV file named 'imdb_top_movies.csv'
Top_movies.to_csv('imdb_top_movies.csv', index=False)

# Print a message indicating that the CSV file has been saved successfully
print("CSV file saved successfully.")



CSV file saved successfully.


##### Load the CSV file 'imdb_top_movies' by using pandas.

In [31]:

# Read the CSV file 'imdb_top_movies.csv' into a DataFrame named 'Top_movies'
Top_movies = pd.read_csv('imdb_top_movies.csv')

# Display the DataFrame 'Top_movies'
Top_movies


Unnamed: 0,Rank,Title,Release_Year,Rating,Vote_Count
0,1,The Shawshank Redemption,1994,9.3,2.9M
1,2,The Godfather,1972,9.2,2M
2,3,The Dark Knight,2008,9.0,2.9M
3,4,The Godfather Part II,1974,9.0,1.4M
4,5,12 Angry Men,1957,9.0,866K
...,...,...,...,...,...
245,246,The Help,2011,8.1,493K
246,247,It Happened One Night,1934,8.1,112K
247,248,Aladdin,1992,8.0,467K
248,249,Dances with Wolves,1990,8.0,291K
