![Example Image](https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252)

## Problem Statement

The provided Jupyter Notebook is designed to scrape and process movie data from IMDb. The key objectives and steps involved in this notebook include:

1. **Web Scraping**: Extracting movie details such as title, rating, vote count, meta score, year, duration, age restriction, and introduction from the IMDb website.
2. **Data Cleaning**: Processing the extracted data to ensure consistency and usability, including converting duration from hours and minutes to total minutes and cleaning the 'title' column.
3. **Data Analysis**: Preparing the cleaned data for further analysis or visualization.
4. **Data Storage**: Saving the cleaned and processed data into a CSV file for future use.

<hr>

The notebook utilizes libraries such as *BeautifulSoup* for web scraping, pandas for data manipulation, and regular expressions for string operations.

The goal is to create a structured dataset from IMDb that can be used for various analytical purposes, such as understanding movie trends, ratings distribution, and more

### Install Libraries

In [None]:
#!pip install selenium
#!pip install tqdm

### Import Libraries

In [1]:
import pandas as pd
import re
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

In [None]:
# URL of the IMDb Top 1000 movies page
url = 'https://www.imdb.com/search/title/?groups=top_1000'

![Example Image](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTrTu1lXB24XivLn6J-ujSHUFvedgtF38EnUg&s)

In [2]:
# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Wait for 1 second before proceeding
time.sleep(1)

# Navigate to the specified URL
driver.get(url)

# Wait for 1 second to ensure the page has loaded
time.sleep(1)

# Print the title of the webpage
print(driver.title)

# Wait for 1 second before proceeding
time.sleep(1)

# Find the body element of the page using CSS selector
body = driver.find_element(By.CSS_SELECTOR, 'body')

# Simulate pressing the PAGE_DOWN key to scroll down the page
body.send_keys(Keys.PAGE_DOWN)

# Wait for 1 second before proceeding
time.sleep(1)
# Scroll down the page again
body.send_keys(Keys.PAGE_DOWN)

# Wait for 1 second before proceeding
time.sleep(1)
# Scroll down the page again
body.send_keys(Keys.PAGE_DOWN)

IMDb Top 1000 (Sorted by Popularity Ascending)


In [3]:
# Find all elements with the class name "ipc-metadata-list-summary-item"
item_list = driver.find_elements(By.CLASS_NAME, "ipc-metadata-list-summary-item")

In [10]:
output = []

# Iterate over each movie element in the item_list
for movie in tqdm(item_list):
    # Extract the movie title using the class name 'ipc-title__text'
    title = movie.find_element(By.CLASS_NAME, 'ipc-title__text').text
    rating = movie.find_element(By.CLASS_NAME, 'ipc-rating-star--rating').text
    intro = movie.find_element(By.CLASS_NAME, 'ipc-html-content-inner-div').text
    vote_count = movie.find_element(By.CLASS_NAME, 'ipc-rating-star--voteCount').text
    
    # Try to extract the Metascore if available, handle exceptions such as if it is a new movie if not found
    try:
        meta_score = movie.find_element(By.CSS_SELECTOR, ".metacritic-score-box").text
    except:
        meta_score = None
    
    # Extract the year, duration, and age restriction using the CSS selector '.dli-title-metadata'
    year, duration, age_restriction = movie.find_element(By.CSS_SELECTOR, ".dli-title-metadata").text.split("\n")
    
    # Append the extracted data to the output list as a list of values
    output.append([title, rating, vote_count, meta_score, year, duration, age_restriction, intro])

100%|██████████| 50/50 [00:06<00:00,  7.77it/s]


In [11]:
# Create a pandas dataframe
movies = pd.DataFrame(output, columns=["title", "rating", "vote_count", "meta_score", "year", "duration", 
                                       "age_restriction", "intro"])

movies.head()

Unnamed: 0,title,rating,vote_count,meta_score,year,duration,age_restriction,intro
0,1. Deadpool & Wolverine,8.2,(94K),56.0,2024,2h 8m,15,Deadpool is offered a place in the Marvel Cine...
1,2. Inside Out 2,7.8,(88K),73.0,2024,1h 36m,U,A sequel that features Riley entering puberty ...
2,3. Deadpool,8.0,(1.1M),65.0,2016,1h 48m,15,A wisecracking mercenary gets experimented on ...
3,4. Deadpool 2,7.6,(661K),66.0,2018,1h 59m,15,Foul-mouthed mutant mercenary Wade Wilson (a.k...
4,5. Maharaja,8.7,(28K),,2024,2h 30m,15,A barber seeks vengeance after his home is bur...


In [12]:
# Check the data types in pandas df
movies.dtypes

title              object
rating             object
vote_count         object
meta_score         object
year               object
duration           object
age_restriction    object
intro              object
dtype: object

In [13]:
# Cleaning 'year' column
movies['year'] = movies['year'].str.extract('(\d+)').astype(int)
movies.head(3)

Unnamed: 0,title,rating,vote_count,meta_score,year,duration,age_restriction,intro
0,1. Deadpool & Wolverine,8.2,(94K),56,2024,2h 8m,15,Deadpool is offered a place in the Marvel Cine...
1,2. Inside Out 2,7.8,(88K),73,2024,1h 36m,U,A sequel that features Riley entering puberty ...
2,3. Deadpool,8.0,(1.1M),65,2016,1h 48m,15,A wisecracking mercenary gets experimented on ...


In [14]:
def convert_to_minutes(time_str):
    """
    Convert a time string in the format 'Xh Ym' to total minutes.

    This function takes a string representing a duration in hours and minutes,
    such as '2h 30m' or '1h', and converts it to the total number of minutes.

    Parameters:
    time_str (str): The time string to convert, in the format 'Xh Ym' where
                    X is the number of hours and Y is the number of minutes.

    Returns:
    int: The total number of minutes represented by the input string.
         Returns None if the input string does not match the expected format.
    """
    pattern = r'(\d+)h\s*(\d+)?m?'
    match = re.search(pattern, time_str)
    if match:
        hours = int(match.group(1))
        minutes = int(match.group(2)) if match.group(2) else 0
        total_minutes = (hours * 60) + minutes
        return total_minutes
    else:
        return None

In [15]:
# Cleaning 'duration' column
movies['duration'] = movies['duration'].apply(lambda x: convert_to_minutes(x))
movies.head()

Unnamed: 0,title,rating,vote_count,meta_score,year,duration,age_restriction,intro
0,1. Deadpool & Wolverine,8.2,(94K),56.0,2024,128,15,Deadpool is offered a place in the Marvel Cine...
1,2. Inside Out 2,7.8,(88K),73.0,2024,96,U,A sequel that features Riley entering puberty ...
2,3. Deadpool,8.0,(1.1M),65.0,2016,108,15,A wisecracking mercenary gets experimented on ...
3,4. Deadpool 2,7.6,(661K),66.0,2018,119,15,Foul-mouthed mutant mercenary Wade Wilson (a.k...
4,5. Maharaja,8.7,(28K),,2024,150,15,A barber seeks vengeance after his home is bur...


In [16]:
# Cleaning 'metascore' column
movies['meta_score'] = movies['meta_score'].str.extract('(\d+)')
# convert it to float and if there are dashes turn it into NaN
movies['meta_score'] = pd.to_numeric(movies['meta_score'], errors='coerce')
movies.head()

Unnamed: 0,title,rating,vote_count,meta_score,year,duration,age_restriction,intro
0,1. Deadpool & Wolverine,8.2,(94K),56.0,2024,128,15,Deadpool is offered a place in the Marvel Cine...
1,2. Inside Out 2,7.8,(88K),73.0,2024,96,U,A sequel that features Riley entering puberty ...
2,3. Deadpool,8.0,(1.1M),65.0,2016,108,15,A wisecracking mercenary gets experimented on ...
3,4. Deadpool 2,7.6,(661K),66.0,2018,119,15,Foul-mouthed mutant mercenary Wade Wilson (a.k...
4,5. Maharaja,8.7,(28K),,2024,150,15,A barber seeks vengeance after his home is bur...


In [17]:
def convert_to_integer(value_str):
    """
    Convert a string representing a number with a suffix (K or M) to an integer.

    This function takes a string in the format '(XK)' or '(YM)', where X is a number
    and Y is a number with optional decimal places, and converts it to the corresponding
    integer value. The suffix 'K' indicates thousands and 'M' indicates millions.

    Parameters:
    value_str (str): The string to convert, in the format '(XK)' or '(YM)'.

    Returns:
    int: The integer value represented by the input string, where 'K' denotes thousands
         and 'M' denotes millions. Returns None if the input string does not match the
         expected format.
    """
    pattern = r'\((\d+(\.\d+)?)?([KM])\)'
    match = re.search(pattern, value_str)
    
    if match:
        number = float(match.group(1))
        suffix = match.group(3)
        if suffix == 'K':
            return int(number * 1000)
        elif suffix == 'M':
            return int(number * 1000000)
    else:
        return None

In [18]:
# Cleaning 'vote_count' column
movies['vote_count'] = movies['vote_count'].apply(lambda x: convert_to_integer(x))
movies.head()

Unnamed: 0,title,rating,vote_count,meta_score,year,duration,age_restriction,intro
0,1. Deadpool & Wolverine,8.2,94000,56.0,2024,128,15,Deadpool is offered a place in the Marvel Cine...
1,2. Inside Out 2,7.8,88000,73.0,2024,96,U,A sequel that features Riley entering puberty ...
2,3. Deadpool,8.0,1100000,65.0,2016,108,15,A wisecracking mercenary gets experimented on ...
3,4. Deadpool 2,7.6,661000,66.0,2018,119,15,Foul-mouthed mutant mercenary Wade Wilson (a.k...
4,5. Maharaja,8.7,28000,,2024,150,15,A barber seeks vengeance after his home is bur...


In [19]:
# Cleaning 'rating' column
movies['rating'] = movies['rating'].astype(float)
movies.head()

Unnamed: 0,title,rating,vote_count,meta_score,year,duration,age_restriction,intro
0,1. Deadpool & Wolverine,8.2,94000,56.0,2024,128,15,Deadpool is offered a place in the Marvel Cine...
1,2. Inside Out 2,7.8,88000,73.0,2024,96,U,A sequel that features Riley entering puberty ...
2,3. Deadpool,8.0,1100000,65.0,2016,108,15,A wisecracking mercenary gets experimented on ...
3,4. Deadpool 2,7.6,661000,66.0,2018,119,15,Foul-mouthed mutant mercenary Wade Wilson (a.k...
4,5. Maharaja,8.7,28000,,2024,150,15,A barber seeks vengeance after his home is bur...


In [20]:
def remove_prefix(movie):
    return re.sub(r'^\d+\.\s*', '', movie)

In [21]:
# Cleaning 'title' column
movies['title'] = movies['title'].apply(lambda x: remove_prefix(x))
movies.head()

Unnamed: 0,title,rating,vote_count,meta_score,year,duration,age_restriction,intro
0,Deadpool & Wolverine,8.2,94000,56.0,2024,128,15,Deadpool is offered a place in the Marvel Cine...
1,Inside Out 2,7.8,88000,73.0,2024,96,U,A sequel that features Riley entering puberty ...
2,Deadpool,8.0,1100000,65.0,2016,108,15,A wisecracking mercenary gets experimented on ...
3,Deadpool 2,7.6,661000,66.0,2018,119,15,Foul-mouthed mutant mercenary Wade Wilson (a.k...
4,Maharaja,8.7,28000,,2024,150,15,A barber seeks vengeance after his home is bur...


In [22]:
movies.dtypes

title               object
rating             float64
vote_count           int64
meta_score         float64
year                 int64
duration             int64
age_restriction     object
intro               object
dtype: object

In [25]:
# Save data to a file
movies.to_csv('data/movies.csv', index=False)

In [None]:
# Close the driver
driver.close()