# IMDb Scraping
For this project we're using urllib.request and BeautifulSoup to scrape the Internet Movie Database from an adavanced search of popular TV titles (approx 3-4K titles). 
- Step 1. will first scrape the title overview section taking surface information

- Step 2. we will use the IMDb id to scrape each titles information page and pull additional information such as rating, seasons, distributors, and full genre list etc... 

- Step 3. (In a seperate notebook) we will investigate the integrity of the scraped data as well as correlations with may exist using this information to fill missing data using groupby mean imputartion and machine learning models

- Step 4. Finaly we will perform data analysis on the evolation of Television metrics over the past 20 years. 



Scraping additional data from the main page of each title and the associated. Worth noting IMDb limits the search results page start number to 9750 after that they replace the starting number with random code, to deter scraping beyond 10K titles. though one could nest a year iterable to each search loop scraping each year of any overall advanced search which generally has less that 10K titles per genre.



In [None]:
# Load libraries 
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from urllib.request import urlopen
import numpy as np
import re


In [None]:
# Creating an array with incriment of 250 to capture the 250 results per page each for each of the 5600+ titles 
nums = np.arange(1, 3500, 250)
nums = nums.astype(str)


# Looping Through The Advanced Search Results
- We will get surface information here
- We will use the IMDbID from the this scrape to dig a little deeper

In [None]:
# Instantiate the lists to capture the div, span, and classes
IMDB = []
TITLE = []
CERT = []
YEARS = []
RATING = []
GENRE = []
VOTES = []
DESC = []
STARS = []
RUNTIME = []


# Looping through the numbers of the array which is concatenated to the link
for num in nums:
    url = "https://www.imdb.com/search/title/?title_type=tv_series&release_date=2000-01-01,2021-12-31&num_votes=500,&countries=us&languages=en&sort=release_date,&count=250&start="+num+"&ref_=adv_nxt&ref_=adv_nxt"
    html = urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    movies = soup.find_all('div', {'class':'lister-item mode-advanced'})

# Looping through the movies contained in the movies lister-item for each page in the num loop
    for i in movies:
        try:
            IMDB.append(i.find('img', {'class':'loadlate'})['data-tconst'][2:])
        except:
            IMDB.append(np.nan)
        try:
            TITLE.append(i.h3.a.text)
        except:
            TITLE.append(np.nan)
        try:
            RUNTIME.append(i.find("span",{'class':'runtime'}).text[:-3])
        except:
            # Add np.nan for later conversion to int and fill with mean/median
            RUNTIME.append(np.nan)
        try:
            # Split the text into release/end years 
            year = i.find("span", {"class": "lister-item-year text-muted unbold"}).text[1:-1]
            YEARS.append(year)
        except:
            YEARS.append(np.nan)
        try:
            GENRE.append(i.find("span", {"class": "genre"}).text[1:])
        except:
            GENRE.append(np.nan)
        try:
            RATING.append(i.find("div", {"class": "inline-block ratings-imdb-rating"})['data-value'])
        except:
            RATING.append(np.nan)
        try:
            # Replacing the comma for easier dytpe conversion later
            VOTES.append(i.find("span", {"name": "nv"}).text.replace(",",""))
        except:
            VOTES.append(np.nan)
        try:
            DESC.append(i.find_all('p', {'class':'text-muted'})[1].text[1:])
        except:
            DESC.append(np.nan)
        try:
            star = i.find_all("a", href=True)[-4:]
            STARS.append(star[0].text+", "+star[1].text+", "+star[2].text+", "+star[3].text)
        except:
            STARS.append(np.nan)


In [None]:
df = pd.DataFrame({'imdb':IMDB,
                   'title':TITLE,
                   'cert':CERT,
                   'years': YEARS,
                   'runtime': RUNTIME,
                   'rating': RATING,
                   'vote':VOTE,
                   'description':DESC,
                   'stars':STARS,
                   })

df.to_csv("/content/drive/MyDrive/1. Full Projects/IMDb Scrape and EDA/datasets/tv_df_00_21.csv", index=False)

In [None]:
# Creating the dataframe from the results
print(df.shape)
df.head()

(3445, 9)


Unnamed: 0,imdb,name,runtime_(mins),years,genre,rating,votes,description,stars
0,273855,My Wife and Kids,30,2001–2005,"Comedy, Family",6.9,26672,"Michael Kyle longs for a traditional life, but...","Damon Wayans, Tisha Campbell, George Gore II, ..."
1,212671,Malcolm in the Middle,22,2000–2006,"Comedy, Family",8.0,124307,A gifted young teen tries to survive life with...,"Frankie Muniz, Bryan Cranston, Justin Berfield..."
2,218787,Ripley's Believe It or Not!,42,2000–2003,Documentary,6.6,1502,Ripley's Believe It or Not! is a curious forma...,"Dean Cain, Kelly Packard, Daniel Browning Smit..."
3,199421,Higher Ground,60,2000,Drama,8.2,1448,Former Wall Street mogul Peter Scarbrow become...,"Joe Lando, Hayden Christensen, A.J. Cook, Kand..."
4,206476,Cleopatra 2525,60,2000–2001,"Sci-Fi, Action, Fantasy",5.9,2082,"An exotic dancer, cryogenically frozen in the ...","Gina Torres, Victoria Pratt, Jennifer Sky, Eli..."


In [None]:
df = pd.read_csv("/content/drive/MyDrive/1. Full Projects/IMDb Scrape and EDA/datasets/tv_df_00_21.csv", dtype={'imdb':'object'} )

# Part 2
* company credits page
* Missing titles end/most recent season years
* No. of episodes
* No. of seasons
* Distribution company
* Full list of genres (not limited to 3)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3445 entries, 0 to 3444
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   imdb            3445 non-null   object 
 1   name            3445 non-null   object 
 2   runtime_(mins)  2984 non-null   object 
 3   years           3445 non-null   object 
 4   genre           3445 non-null   object 
 5   rating          3445 non-null   float64
 6   votes           3445 non-null   int64  
 7   description     3445 non-null   object 
 8   stars           3445 non-null   object 
dtypes: float64(1), int64(1), object(7)
memory usage: 242.4+ KB


In [None]:
imdb_tvgenres = \
"""Action
Adventure
Animation
Biography
Comedy
Crime
Documentary
Drama
Family
Fantasy
Game-Show
History
Horror
Music
Musical
Mystery
News
Reality-TV
Romance
Sci-Fi
Sport
Short
Talk-Show
Thriller
War
Western"""

In [None]:
# Lists that contin the imdb id for most scapres and release for to assist in pulling correct distributors 
imdbid = df['imdb'].unique()

In [None]:
genre_split = imdb_tvgenres.split('\n')
summary_pattern = '(?s)(?<=plot summary).*?(?=plot summary)'
season_pattern = "Seasons:\s+\d+|Season:\s+\d+"
wins = "\d+ wins"
wons = "won \d+"
time_pattern = '\d+min'


GENRES = {}
SUMMARY= {}
SEASONS = {}
CERT = {}
DIST = {}
AWARDS = {}
RUNTIME = {}
KWS = {}

i = len(df)
for imdb in imdbid:
  url = "https://www.imdb.com/title/tt"+imdb+"/reference"
  html = urlopen(url)
  soup = BeautifulSoup(html, "html.parser")

  genre_container = soup.find_all("ul", {"class":"ipl-inline-list"})
  distributors = soup.find_all("ul", {"class":"simpleList"})
  plot_container = soup.find_all("table", {"class":"titlereference-list ipl-zebra-list"})
  certificates = soup.find_all("ul", {"class":"ipl-inline-list"})
  seasons = soup.find_all("section", {"class":"titlereference-section-overview"})
  awards = soup.find_all("div", {"class":"titlereference-overview-section"})
  runtime = soup.find_all('ul', {'class':'ipl-inline-list'})

  RUNTIME[imdb] = re.findall(time_pattern, runtime[0].text)
  SEASONS[imdb] = re.findall(season_pattern, seasons[0].text.strip())
  DIST[imdb] = distributors[1].text.strip().split('\n')[0]

  # Can merge
  for cert in certificates:
    try:
      x = ''.join(re.findall(r"United States:[\w+\d-]+", cert.text))
      y = x.replace("United States:", "")
      if y != "":
        certs[imdb] = y[:5]
    except:
      certs[imdb] = 'None'


  genre_str = ""
  for x in genre_container[3:]:
    genre_str += x.text
  parsed_genres = re.findall('|'.join(genre_split), genre_str)
  GENRES[imdb] = parsed_genres

  plot_str = ""
  for x in plot_container[:1]:
    plot_str += str(x.text.lower())
  parsed_summary = re.findall(summary_pattern, plot_str)
  SUMMARY[imdb] = parsed_summary


  award_string = ""
  for x in awards:
    award_string += str(x.text.strip().lower())
    try:
      win_cap = re.search(wins, award_string).group(0).split(" ")[0]
      won_cap = re.search(wons, award_string).group(0).split(" ")[1]
      total = int(win_cap) + int(won_cap)
      AWARDS[imdb] = total
    except:
      AWARDS[imdb] = 0

      
  kw_url = "https://www.imdb.com/title/tt"+imdb+"/keywords"
  kw_soup = urlopen(kw_url)
  kw_soup = BeautifulSoup(kw_soup, "html.parser")
  kw_container = kw_soup.find_all("div", {"class":"sodatext"})
  lst = []
  for x in kw_container:
    lst.extend(x.text.strip().split('\n'))
  KWS[imdb] = lst

  print(i)
  i-=1 

In [None]:
df['imdb'] = imdbid

df['seasons'] = df['imdb'].map(season_dict)
df['distributors'] = df['imdb'].map(dist)
df['certificates'] = df['imdb'].map(certs)
df['full_genres'] = df['imdb'].map(genres)
df['plot_summary'] = df['imdb'].map(summary)
df['keywords'] = df['imdb'].map(kw)
df['awards'] = df['imdb'].map(total_wins)

In [None]:
df.head()

Unnamed: 0,imdb,name,runtime_(mins),years,rating,votes,description,stars,seasons,distributors,plot_summary,keywords,awards,release_year,end_year,season_numeric,certs_merged,certs_numeric,runtime_numeric,platform,platform_numeric,clean_genres,action,adventure,animation,biography,comedy,crime,documentary,drama,family,fantasy,game-show,history,horror,music,mystery,news,reality-tv,romance,sci-fi,sport,short,talk-show,thriller,war,western
0,273855,My Wife and Kids,30,2001–2005,6.9,26672,"Michael Kyle longs for a traditional life, but...","Damon Wayans, Tisha Campbell, George Gore II, ...",['Seasons: \n 5'],Buena Vista Television,damon wayans plays michael kyle a man on a tr...,"laughter, 2000s, dark comedy, adolescent, ki...",0.0,2001,2005,5,tv-pg,1,30.0,basic_cable,4,"Comedy,Family",0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,212671,Malcolm in the Middle,22,2000–2006,8.0,124307,A gifted young teen tries to survive life with...,"Frankie Muniz, Bryan Cranston, Justin Berfield...",['Seasons: \n 7'],Fox Network,an offbeat laugh track lacking sitcom about a...,"father son relationship, exploitation of fri...",46.0,2000,2006,7,tv-pg,1,22.0,network,1,"Comedy,Family",0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,218787,Ripley's Believe It or Not!,42,2000–2003,6.6,1502,Ripley's Believe It or Not! is a curious forma...,"Dean Cain, Kelly Packard, Daniel Browning Smit...",['Seasons: \n 4'],TBS Superstation,ripley s believe it or not is a curious forma...,catching a bullet in one s teeth,0.0,2000,2003,4,unk,4,42.0,basic_cable,4,Documentary,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,199421,Higher Ground,60,2000,8.2,1448,Former Wall Street mogul Peter Scarbrow become...,"Joe Lando, Hayden Christensen, A.J. Cook, Kand...",['Season: \n 1'],Fox Family Channel,located high in the mountains of the northwest...,"teenage boy, drug, troubled teenager, high s...",0.0,2000,2025,1,tv-pg,1,60.0,network,1,Drama,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,206476,Cleopatra 2525,60,2000–2001,5.9,2082,"An exotic dancer, cryogenically frozen in the ...","Gina Torres, Victoria Pratt, Jennifer Sky, Eli...",['Seasons: \n 2'],M.I.B.,an exotic dancer cryogenically frozen in the ...,"disembodied voice, psychotronic series, camp...",0.0,2000,2001,2,tv-14,2,60.0,basic_cable,4,"Sci-Fi,Action,Fantasy",1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [None]:
df.to_csv("/content/drive/MyDrive/1. Full Projects/IMDb Scrape and EDA/datasets/tv_df_raw.csv", index=False)