# TV Shows Rating Analysis using Python

Analyzing the TV Shows from IMDB rating given by the spectators to pick up what we watched depends on the rating. Now we use web scraping in python to extract IMDb TV Shows ratings and its ratings. In this process we will extract inforamtion from website using python and some libraries known for web scraping called requests and beautiful soup.
The steps to extract the information:
1. Load the libraries needed for project.
2. Download the page using requests.
3. Parse the html source code using BeautifulSoup.
4. Casting the list to a DataFrame

In [2]:
#import the required libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots

### Parse the html source code using BeautifulSoup

In [3]:
# download imdb top rated TV Shows
url='https://www.imdb.com/chart/toptv/?ref_=nv_tvv_250'
data=requests.get(url).text
soup=BeautifulSoup(data,'html.parser')

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <style>
   body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
  </style>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   IMDb Top 250 TV - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
  </script>
  <link href="http

In [20]:
# extract the tv shows 
tv_shows = soup.select('td.titleColumn')
crew= [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')] 


### Extract the information and data about TV shows

* name
* year
* rating
* duration
* genre
* votes
* certificate
* gross. 

In [9]:
# after extracting the tv shows detials,create an empty list and store the details in
# dictonary and then add them to a list
list=[]
for i in range (0,len(tv_shows)):
    tv_show_string=tv_shows[i].get_text()
    tv_show=(' '.join(tv_show_string.split()).replace('.',''))
    tv_show_title=tv_show[len(str(i))+1:-7]
    year=re.search('\((.*?)\)',tv_show_string).group(1)
    place=tv_show[len(str(i))-(len(tv_show))]
    tv_show_data={"place":place,
                 "tv_show_title":tv_show_title,
                  "year":year,
                  "rating": float(ratings[i]),
                 
                 "star_cast":crew[i]}
    list.append(tv_show_data)

In [10]:
#the list was filled with top imdb tv shows and their details.
for tv_show in list:
    print(tv_show['place'],'-',tv_show['tv_show_title'],'('+tv_show['year']+')-', 'starring:',tv_show['star_cast'],tv_show['rating'])

  - Planet Earth II (2016)- starring: David Attenborough, Gordon Buchanan 9.435217337859372
  - Breaking Bad (2008)- starring: Bryan Cranston, Aaron Paul 9.428723626985452
  - Planet Earth (2006)- starring: Sigourney Weaver, David Attenborough 9.413601250056269
  - Band of Brothers (2001)- starring: Scott Grimes, Damian Lewis 9.389987975465136
  - Chernobyl (2019)- starring: Jessie Buckley, Jared Harris 9.31633076233821
  - The Wire (2002)- starring: Dominic West, Lance Reddick 9.289576281441015
  - Blue Planet II (2017)- starring: David Attenborough, Peter Drost 9.230684164505488
  - Avatar: The Last Airbender (2005)- starring: Dee Bradley Baker, Zach Tyler Eisen 9.230607870044567
  - Cosmos: A Spacetime Odyssey (2014)- starring: Neil deGrasse Tyson, Christopher Emerson 9.202284833868452
0 -  The Sopranos (1999)- starring: James Gandolfini, Lorraine Bracco 9.201057107655533
  - Cosmos (1980)- starring: Carl Sagan, Jaromír Hanzlík 9.187613045278635
  - Our Planet (2019)- starring: Davi

In [11]:
list

[{'place': ' ',
  'tv_show_title': 'Planet Earth II',
  'year': '2016',
  'rating': 9.435217337859372,
  'star_cast': 'David Attenborough, Gordon Buchanan'},
 {'place': ' ',
  'tv_show_title': 'Breaking Bad',
  'year': '2008',
  'rating': 9.428723626985452,
  'star_cast': 'Bryan Cranston, Aaron Paul'},
 {'place': ' ',
  'tv_show_title': 'Planet Earth',
  'year': '2006',
  'rating': 9.413601250056269,
  'star_cast': 'Sigourney Weaver, David Attenborough'},
 {'place': ' ',
  'tv_show_title': 'Band of Brothers',
  'year': '2001',
  'rating': 9.389987975465136,
  'star_cast': 'Scott Grimes, Damian Lewis'},
 {'place': ' ',
  'tv_show_title': 'Chernobyl',
  'year': '2019',
  'rating': 9.31633076233821,
  'star_cast': 'Jessie Buckley, Jared Harris'},
 {'place': ' ',
  'tv_show_title': 'The Wire',
  'year': '2002',
  'rating': 9.289576281441015,
  'star_cast': 'Dominic West, Lance Reddick'},
 {'place': ' ',
  'tv_show_title': 'Blue Planet II',
  'year': '2017',
  'rating': 9.230684164505488,
 

## Data Preparation and Cleaning

In [12]:
#casting the list to a DataFrame
df=pd.DataFrame(list)
# this command shows the 5 first rows of dataframe
df.head(5)

Unnamed: 0,place,tv_show_title,year,rating,star_cast
0,,Planet Earth II,2016,9.435217,"David Attenborough, Gordon Buchanan"
1,,Breaking Bad,2008,9.428724,"Bryan Cranston, Aaron Paul"
2,,Planet Earth,2006,9.413601,"Sigourney Weaver, David Attenborough"
3,,Band of Brothers,2001,9.389988,"Scott Grimes, Damian Lewis"
4,,Chernobyl,2019,9.316331,"Jessie Buckley, Jared Harris"


In [13]:
# this command let know the data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   place          250 non-null    object 
 1   tv_show_title  250 non-null    object 
 2   year           250 non-null    object 
 3   rating         250 non-null    float64
 4   star_cast      250 non-null    object 
dtypes: float64(1), object(4)
memory usage: 9.9+ KB


In [14]:
df.isnull().sum()

place            0
tv_show_title    0
year             0
rating           0
star_cast        0
dtype: int64

In [19]:
### save data to csv
df.to_csv('imdb_top_250_tv_shows.csv',index=False)