<a href="https://colab.research.google.com/github/Rahul30032/IMDb_web_scraper/blob/main/IMDb_web_scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Internship Selection Task - MedBikri - web scraping 

* Task Description - Scraping Multiplle Pages of of websites to obtain formatted data into a csv.
* website URL - [IMDb "Top 1000" (Sorted by Popularity Ascending)](https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv) 
* Tech used - Python with bs4(BeautifulSoup) for scraping 
* Note : Following code was written on google colab and with appropriate changes can be run on local system as well. 

In [10]:
from google.colab import drive   # for colab-drive integratiion 

In [11]:
drive.mount('/content/gdrive')  # mounting the gdrive 

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [16]:
# importing all necessary tools 

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

from time import sleep
from random import randint

In [17]:
# intialising all the lists we gonna store our scraped data to 
titles = []                       # movie titles 
years = []                        # year of release 
time = []                         # total run-time 
imdb_ratings = []                 # IMDb ratings 
metascores = []                   #Metascore 
votes = []                        # Votes 
us_gross = []                     # Total gross in USD 

In [18]:
# ensuring we obtain english translated movie titles only.
headers = {"Accept-Language": "en-US,en;q=0.5"}

In [19]:
# pages is a URL parameter which helps us move to the next page after scraping through the current page
pages = np.arange(1, 1001, 50) # each page shows 50 movie titles and description. and that is reflected in the URL 

for page in pages: 

  page = requests.get("https://www.imdb.com/search/title/?groups=top_1000&start=" + str(page) + "&ref_=adv_nxt", headers=headers)   

  soup = BeautifulSoup(page.text, 'html.parser')
  movie_div = soup.find_all('div', class_='lister-item mode-advanced')
  
  sleep(randint(2,10))    # controlling crawling rate as otherwise speedy "get" requests puts presssure on website server 

  for container in movie_div:

        name = container.h3.a.text
        titles.append(name)
        
        year = container.h3.find('span', class_='lister-item-year').text
        years.append(year)

        runtime = container.p.find('span', class_='runtime') if container.p.find('span', class_='runtime') else ''
        time.append(runtime)

        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)

        m_score = container.find('span', class_='metascore').text if container.find('span', class_='metascore') else ''
        metascores.append(m_score)

        nv = container.find_all('span', attrs={'name': 'nv'})
        
        vote = nv[0].text
        votes.append(vote)
        
        grosses = nv[1].text if len(nv) > 1 else ''
        us_gross.append(grosses)



In [22]:
# storing all lists into a dataframe 
movies = pd.DataFrame({
'movie': titles,
'year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes,
'us_grossMillions': us_gross,
'timeMin': time
})

# data cleaning 

movies['votes'] = movies['votes'].str.replace(',', '').astype(int)  # extracting all digits in the string and converting it to integers 

movies.loc[:, 'year'] = movies['year'].str[-5:-1].astype(int)

movies['timeMin'] = movies['timeMin'].astype(str)
movies['timeMin'] = movies['timeMin'].str.extract('(\d+)').astype(int)

movies['metascore'] = movies['metascore'].str.extract('(\d+)')
movies['metascore'] = pd.to_numeric(movies['metascore'], errors='coerce')

movies['us_grossMillions'] = movies['us_grossMillions'].map(lambda x: x.lstrip('$').rstrip('M'))  # strips $ from start and M from the end of gross if provided
movies['us_grossMillions'] = pd.to_numeric(movies['us_grossMillions'], errors='coerce')


# # to see your dataframe
# print(movies)

# # to see the datatypes of your columns
# print(movies.dtypes)

# # to see where you're missing data and how much data is missing 
# print(movies.isnull().sum())

# to move all your scraped data to a CSV file
movies.to_csv('movies.csv')

In [21]:
# code to be run specifically on colab, need not run it on local environments like Jupyter Notebook
!cp movies.csv "/content/gdrive/My Drive/imdb_scraping"   # saves CSV to specified drive location 