
## Task: Web Scraping

Your task is to scrape all Hindi movies data, including name, year, director, rating, genre, top 5 cast, and image poster, and update this information in an Excel sheet. Scraping the data is at your discretion; you can scrape from any website.

Note:

* In the image column, you need to scrape the URL for the particular movie image.
* You have to submit your Excel sheet along with the Python script with clear documentation.





## Import Modules

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


## Request page source from URL

In [2]:
# URL of the IMDb page
url = 'https://www.imdb.com/list/ls004221468/'
response = requests.get(url)
response

<Response [403]>

In [3]:
# URL of the IMDb page
url = 'https://www.imdb.com/list/ls004221468/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers)
response

<Response [200]>

In [4]:
# Check if the request was successful
if response.status_code == 200:
    print("Request was successful.")
else:
    print(f"Request failed with status code: {response.status_code}")

Request was successful.


In [5]:
## display the page content
# response.content

In [6]:
soup = BeautifulSoup(response.content, 'html.parser')
#print(soup.prettify())

In [7]:
# Function to get movie details from IMDb page
def get_movie_details(movie_link, headers):
    # Function to get movie details from IMDb page
    response = requests.get(movie_link, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    cast_tags= soup.find_all('a', class_='sc-bfec09a1-1 gCQkeh')
    # Extract the top 5 cast names
    top_5_cast = [tag.text for tag in cast_tags[:5]]
    genre_tags=soup.find_all('a', class_="ipc-chip ipc-chip--on-baseAlt")
    genre=[tag.find('span', class_='ipc-chip__text').text for tag in genre_tags]
    #print(top_5_cast)
    #print(genre)
    return genre, ', '.join(top_5_cast)

In [None]:
# Lists to store the movie data
movies = []

# Find all movie containers
movie_containers = soup.find_all('div', class_='ipc-metadata-list-summary-item__c')

#print(movie_containers)

# Loop through each movie container to extract data
for container in movie_containers:
  # Movie name and year
  Name = container.find('h3', class_='ipc-title__text').text.strip('()')
  # Remove leading numbers and periods from movie name
  Movie_Name = ' '.join(Name.split(' ')[1:])
  #print(Movie_Name)
  Year = container.find('span', class_='sc-b189961a-8 kLaxqf dli-title-metadata-item').text.strip('()')
  #print(Year)
  director_tags = container.find_all('span', class_='sc-74bf520e-5 ePoirh')
  director_tags1=[tag.find('a', class_='ipc-link ipc-link--base dli-director-item') for tag in director_tags]
  Director = [director.text for director in director_tags1 if director is not None]

  #print(Director)
  Rating = container.find('span', class_='ipc-rating-star--rating').text.strip('()')
  #print(Rating)
  #Cast = container.find('span', class_='sc-74bf520e-5 ePoirh').find('a')
  #print(Cast)
  # Image URL
  image_tag = container.find('img')
  image_url = image_tag['src'] if image_tag else 'N/A'
  #print(image_url)

  link=container.find('a')['href']
  #print(link)
  # Extract the link to the movie's individual page
  movie_link = 'https://www.imdb.com' + link
  #print(movie_link)

  # Get genre and top 5 cast from the movie's individual page
  genre, top_5_cast = get_movie_details(movie_link, headers)

  # Join the Director and genre lists into comma-separated strings
  Director_str = ', '.join(Director)
  genre_str = ', '.join(genre)

  # Append movie data to the list
  movies.append([Movie_Name, Year, Director_str, genre_str, top_5_cast,  Rating, image_url])

  # Sleep to avoid overwhelming the server
  #time.sleep(1)

# Create a DataFrame from the movie list
df = pd.DataFrame(movies, columns=['Movie', 'Year', 'Director', 'Genre', 'Top 5 Cast', 'Rating', 'Image_URL']) #

# Save the DataFrame to an Excel file
df.to_excel('imdb_hindi_movies.xlsx', index=False)


print("Data has been successfully scraped and saved to imdb_hindi_movies.xlsx")


