Project Introduction
Title: Web Scraping IMDB for Movie and Actor Data
Overview
In today's data-driven world, timely and relevant information is crucial. Data on websites like IMDB is often updated, and it's important to have systems that can periodically retrieve new or updated data. In this project, we will build a robust web scraping system to gather data from the IMDB website on movies and their star actors. This system will allow us to fetch and update our dataset at different points in time, ensuring that we always have the most current information.

Objectives
Collect Movie Data: Scrape and collect detailed data on movies, including:

Name

Release Year

Duration

IMDB Rating

Popularity Metric

Up to the First 3 Genre Tags

Star Actors

Number of Award Nominations

Number of Award Wins

Collect Actor Data: Scrape and collect detailed data on actors starring in the selected movies, including:

Name

Number of Credits

Number of Award Nominations

Number of Award Wins

Methodology
Environment Setup:

Install required libraries: requests, BeautifulSoup, pandas, and optionally Selenium.

Web Scraping:

Use BeautifulSoup for parsing HTML content of IMDB pages.

Optionally use Selenium for dynamic content and navigating through pages.

Create functions to scrape movie data and actor data from IMDB.

Data Processing:

Clean and process the collected data to ensure completeness and accuracy.

Merge movie and actor data for comprehensive analysis.

Data Storage:

Save the collected and processed data to CSV files for easy access and future use.

Implement a scheduler to run the scraping script at different points in time for updated data.

Exploratory Data Analysis (EDA):

Perform EDA to explore patterns and trends in the collected data.

Generate visualizations to gain insights from the data.

In-Depth Analysis:

Answer in-depth questions based on the collected data, such as:

Trends in IMDB ratings over the years for top-rated movies.

Actors with the highest number of awards and nominations.

Correlation between the popularity metric and the number of award wins for movies.

Team Collaboration
To effectively split the work, each team member will have specific tasks:

Team Member 1: Data Collection (Movies)

Collect movie data using the OMDB API and web scraping.

Ensure data accuracy and completeness.

Team Member 2: Data Collection (Actors)

Collect actor data using additional sources (e.g., Wikipedia API, web scraping).

Ensure data accuracy and completeness.

Team Member 3: Data Processing & Cleaning

Load and clean the collected data.

Merge movie and actor data for analysis.

Team Member 4: Exploratory Data Analysis (EDA)

Perform EDA and generate visualizations.

Document insights and findings.

In [108]:
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service

In [150]:
api_key = 'f422b58e'

def fetch_movie_data(movie_titles, api_key):
    movie_data = []
    
    for title in movie_titles:
        search_url = f'http://www.omdbapi.com/?t={title}&apikey={api_key}'
        response = requests.get(search_url)
        
        # Check for a successful response
        if response.status_code == 200:
            try:
                movie_info = response.json()
                if movie_info['Response'] == 'True':
                    awards_text = movie_info.get('Awards', 'N/A')
                    wins, nominations = "0", "0"
                    if "wins" in awards_text and "nominations" in awards_text:
                        wins = awards_text.split("wins")[0].strip()
                        nominations = awards_text.split("&")[-1].split("nominations")[0].strip()

                    movie_data.append({
                        "Title": movie_info.get('Title', 'N/A'),
                        "Year": movie_info.get('Year', 'N/A'),
                        "Duration": movie_info.get('Runtime', 'N/A'),
                        "IMDB Rating": movie_info.get('imdbRating', 'N/A'),
                        "Genre": movie_info.get('Genre', 'N/A'),
                        "Actors": movie_info.get('Actors', 'N/A'),
                        "Nominations": nominations,
                        "Wins": wins,
                        "Popularity": movie_info.get('imdbVotes', 'N/A')
                    })
                else:
                    print(f"Could not find results for {title}")
            except ValueError:
                print(f"Error decoding JSON for {title}: {response.text}")
        else:
            print(f"Failed to fetch data for {title}: {response.status_code}")
    
    return pd.DataFrame(movie_data)

# Updated movie list with 100 items
movie_titles = [
    "Inception", "The Dark Knight", "Interstellar", "The Matrix", "Pulp Fiction",
    "The Lord of the Rings: The Fellowship of the Ring", "The Godfather", "The Shawshank Redemption",
    "Fight Club", "Forrest Gump", "The Empire Strikes Back", "The Dark Knight Rises", "Gladiator",
    "The Silence of the Lambs", "Saving Private Ryan", "Braveheart", "Schindler's List", "The Lion King",
    "Jurassic Park", "The Avengers", "Titanic", "The Departed", "The Wolf of Wall Street", "Django Unchained",
    "The Terminator", "Alien", "Blade Runner", "Goodfellas", "The Usual Suspects", "The Big Lebowski",
    "The Sixth Sense", "Se7en", "Avatar", "Avengers: Endgame", "Back to the Future", "Indiana Jones and the Last Crusade",
    "Harry Potter and the Philosopher's Stone", "Pirates of the Caribbean: The Curse of the Black Pearl",
    "Toy Story", "Finding Nemo", "The Incredibles", "Inside Out", "The Exorcist", "Jaws", "Rocky",
    "A Clockwork Orange", "The Shining", "E.T. the Extra-Terrestrial", "The Breakfast Club", "Ferris Bueller's Day Off",
    "The Truman Show", "The Princess Bride", "Monty Python and the Holy Grail", "Groundhog Day", "The Grand Budapest Hotel",
    "La La Land", "Get Out", "Mad Max: Fury Road", "The Godfather: Part II", "Apocalypse Now", "Blade Runner 2049",
    "Casino Royale", "Doctor Strange", "Guardians of the Galaxy", "Logan", "The Revenant", "Spider-Man: Homecoming",
    "The Hateful Eight", "Once Upon a Time in Hollywood", "Parasite", "The Irishman", "1917",
    "Jojo Rabbit", "Joker", "Frozen", "The Shape of Water", "Three Billboards Outside Ebbing, Missouri",
    "Lady Bird", "The King's Speech", "The Social Network", "The Pursuit of Happyness", "A Beautiful Mind",
    "Shutter Island", "Inglourious Basterds", "Catch Me If You Can", "The Curious Case of Benjamin Button",
    "Black Swan", "Slumdog Millionaire", "The Green Mile", "The Notebook", "A Star Is Born",
    "The Help", "Gravity", "The Martian", "Gone Girl", "Whiplash", "The Revenant",
    "Shawshank Redemption", "Schindler's List", "Raging Bull", "Casablanca", "Citizen Kane",
    "Gone with the Wind", "Lawrence of Arabia", "The Godfather: Part II", "One Flew Over the Cuckoo's Nest", "Star Wars",
    "12 Angry Men", "Psycho", "Rear Window", "The Good, the Bad and the Ugly", "Sunset Boulevard",
    "Silence of the Lambs", "Raiders of the Lost Ark", "It's a Wonderful Life", "American Beauty", "Jaws", "The Exorcist",
    "The Silence of the Lambs", "Saving Private Ryan", "Braveheart", "The Lion King", "Titanic",
    "The Departed", "Gladiator", "Rocky", "E.T. the Extra-Terrestrial", "The Breakfast Club",
    "The Truman Show", "Fight Club", "Harry Potter and the Chamber of Secrets", "The Hunger Games", "The Last Samurai",
    "Pirates of the Caribbean: Dead Man's Chest", "The Prestige", "The Pursuit of Happyness", "Star Trek", "The Hobbit: An Unexpected Journey"
]

# Fetch movie data
df_movies = fetch_movie_data(movie_titles, api_key)

# Display movie data
print(df_movies.head())


             Title  Year Duration IMDB Rating                      Genre  \
0        Inception  2010  148 min         8.8  Action, Adventure, Sci-Fi   
1  The Dark Knight  2008  152 min         9.0       Action, Crime, Drama   
2     Interstellar  2014  169 min         8.7   Adventure, Drama, Sci-Fi   
3       The Matrix  1999  136 min         8.7             Action, Sci-Fi   
4     Pulp Fiction  1994  154 min         8.9               Crime, Drama   

                                              Actors Nominations  \
0  Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...         220   
1        Christian Bale, Heath Ledger, Aaron Eckhart         165   
2  Matthew McConaughey, Anne Hathaway, Jessica Ch...         148   
3  Keanu Reeves, Laurence Fishburne, Carrie-Anne ...          52   
4      John Travolta, Uma Thurman, Samuel L. Jackson          72   

                Wins Popularity  
0  Won 4 Oscars. 159  2,645,456  
1  Won 2 Oscars. 164  2,974,670  
2    Won 1 Oscar. 44  2,259,078 

In [152]:
# Save DataFrame to CSV in the specified directory
df_movies.to_csv('C:/Users/robin/Documents/GitHub/movies_data.csv', index=False)

