<a href="https://colab.research.google.com/github/JuJu2181/2048.io/blob/master/imdb_episodes_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## A Simple Web Scraper created using Python and BS4 which will scrape imdb website to get episode details for a series which will later be used for data analysis and model building

For Connecting gdrive to colab

In [64]:
#code to connect to google drive
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [72]:
%cd /content/gdrive/MyDrive/web_scraping_data/

/content/gdrive/MyDrive/web_scraping_data


Code starts From Here

In [2]:
#imports
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [34]:
# Testing BS4
# Retreive website Data
# url = 'https://www.imdb.com/title/tt0458290/episodes?season=1'
# response = requests.get(url)
# season_page = BeautifulSoup(response.content)
# # season_page
# episode_tiles = season_page.findAll('div',attrs={'class':'info'})
# # episode_tiles[0]

In [54]:
def getSeasonsCount(input_url):
    """
    Function to get total no of seasons present in the input series
    Parameters: 
    input_url: URL to scrape from
    Returns: 
    Total number of seasons (int)
    """
    url = input_url
    response = requests.get(url)
    season_page = BeautifulSoup(response.content)
    seasonsSelect = season_page.find('div',attrs={'class':'episode-list-select'}).find('select', attrs={'id':'bySeason'})
    optionsVal = [int(option.text.strip()) for option in seasonsSelect.findAll('option')]
    return max(optionsVal)


In [57]:
# For scraping from all the seasons 
def scrape_all_data(input_url):
    """
    Function that will scrape all the episodes data from a series
    Parameters:
    input_url: URL to scrape from
    Returns:
    dictionary containing scraped data
    """
    no_of_seasons = getSeasonsCount(input_url)
    season_no_list = []
    episode_no_list = []
    episode_name_list = []
    episode_airdate_list = []
    episode_score_list = []
    episode_votes_list = []
    for season in range(no_of_seasons):
        season_no = season+1
        print(f"Parsing Season {season_no}")
        # Retreive website Data
        url = f'{input_url}?season={season_no}'
        response = requests.get(url)
        season_page = BeautifulSoup(response.content)
        episodes = season_page.findAll('div',attrs={'class':'info'})
        print(f'There are {len(episodes)} episodes in season {season_no}')
        for episode_no, episode in enumerate(episodes):
            # episode_name 
            episode_name = episode.strong.a.text.strip()
            # episode_airdata 
            episode_airdate = episode.find('div',attrs={'class':'airdate'}).text.strip().replace('.','')
            # episode_score 
            try:
                episode_score = episode.find(
                'div',attrs={'class':'ipl-rating-widget'}
                ).find(
                    'div',attrs={'class':'ipl-rating-star small'}
                ).find('span',attrs={'class':'ipl-rating-star__rating'}).text
            except:
                episode_score= 0
        
            # episode_votes 
            try:
                episode_votes = episode.find(
                'div',attrs={'class':'ipl-rating-widget'}
                ).find(
                    'div',attrs={'class':'ipl-rating-star small'}
                ).find('span',attrs={'class':'ipl-rating-star__total-votes'}).text.strip('()').replace(',','')
            except:
                episode_votes = 0
            # print(f'------------------- Episode {episode_no+1} -------------------')
            # print(f'Episode Name: {episode_name}')
            # print(f'Episode Date: {episode_airdate}')
            # print(f'Episode Score: {episode_score}')
            # print(f'Episode Votes: {episode_votes}')
            # print('--------------------------------------------------')
            
            # creating lists
            season_no_list.append(season_no)
            episode_no_list.append(episode_no+1)
            episode_name_list.append(episode_name)
            episode_airdate_list.append(episode_airdate)
            episode_score_list.append(episode_score)
            episode_votes_list.append(episode_votes)

    print('Parsing Done')
    # creating dictionary from list 
    data_dict = {
        'season_no': season_no_list,
        'episode no': episode_no_list,
        'episode_name': episode_name_list,
        'episode_airdate': episode_airdate_list,
        'episode_score': episode_score_list,
        'episode_votes': episode_votes_list
    }

    return data_dict


In [86]:
# Used imdb package to get movie url for input name
import imdb
def getURlByName(series_name):
    """
    Function to get url of the series entered by user 
    Parameters: 
    series_name: Name of the series to scrape
    Returns: 
    URL: URL of the website to scrape
    """
    ia = imdb.IMDb()
    results = ia.search_movie(series_name)
    URL = ia.get_imdbURL(results[0])
    return URL

In [87]:
# Function to save the dataset in a csv file
def saveDataset(df,name):
    name = name.replace(' ','_').lower()
    df.to_csv(f'{name}_data.csv')

In [88]:
# create dataframe with pandas 
# URL will be of this format
# URL = 'https://www.imdb.com/title/tt0458290/episodes'
series_name = input("Input name of series to generate data: ")
URL = getURlByName(series_name)+'episodes'
print(f'URL for {series_name}: {URL}')
print('----------------- SCRAPING BEGINS ----------------------------')
data = scrape_all_data(URL)
print('----------------- SCRAPING ENDS ------------------------------')
df = pd.DataFrame(data)
print('----------------- Saving Data --------------------------------')
saveDataset(df,series_name)
print('----------------- Dataset Saved ------------------------------')
df.head()

Input name of series to generate data: Stranger Things
URL for Stranger Things: https://www.imdb.com/title/tt4574334/episodes
----------------- SCRAPING BEGINS ----------------------------
Parsing Season 1
There are 8 episodes in season 1
Parsing Season 2
There are 9 episodes in season 2
Parsing Season 3
There are 8 episodes in season 3
Parsing Season 4
There are 9 episodes in season 4
Parsing Season 5
There are 1 episodes in season 5
Parsing Done
----------------- SCRAPING ENDS ------------------------------
----------------- Saving Data --------------------------------
----------------- Dataset Saved ------------------------------


Unnamed: 0,season_no,episode no,episode_name,episode_airdate,episode_score,episode_votes
0,1,1,Chapter One: The Vanishing of Will Byers,15 Jul 2016,8.5,23751
1,1,2,Chapter Two: The Weirdo on Maple Street,15 Jul 2016,8.4,21125
2,1,3,"Chapter Three: Holly, Jolly",15 Jul 2016,8.8,20983
3,1,4,Chapter Four: The Body,15 Jul 2016,8.9,20586
4,1,5,Chapter Five: The Flea and the Acrobat,15 Jul 2016,8.7,19503


In [89]:
%ls

Star_Wars_the_clone_wars_data.csv  stranger_things_data.csv
