## Scraping https://www.ratingraph.com/tv-shows/

Using Selenium to collect

- Show title
- Link stub
- Start Year
- End Year
- Genres
- Number of Seasons
- Number of Episodes
- Season one rating if possible

The shows are restricted to having at least one season, and release year of at least 1995

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
import time
import random

In [2]:
def grab_cols4_to8():
    '''
    function that grabs data from columns 4 through 8 on ratingraph.com
    ---
    input: no input
    ---
    returns: five lists- start_years, end_years, genres, seasons, episodes 
    '''
    
    start_years = []
    end_years = []
    genres = []
    seasons = []
    episodes = []

    for i in range(1, 251): #goes through the length of the page
        for j in range(4, 9): #goes through the columns
            temp = driver.find_element_by_xpath("/html/body/div/main/section/div[2]/table/tbody/tr["+ str(i) +"]/td["+ str(j) +"]")

            if j == 4:
                start_years.append(temp.text)
            elif j == 5:
                end_years.append(temp.text)
            elif j == 6:
                genres.append([temp.text])
            elif j == 7:
                seasons.append(temp.text)
            elif j == 8:
                episodes.append(temp.text)        
    
    return start_years, end_years, genres, seasons, episodes

In [3]:
def grab_titles_and_links():
    '''
    function that grabs the titles of each show and the corresponding links
    ---
    input: no input
    ---
    returns: two lists- titles, and links
    '''    
    web_titles = driver.find_elements_by_xpath(".//a[@target = '_top']")
    titles = [show.text for show in web_titles]
    links = [href.get_attribute('href') for href in web_titles]
    
    return titles, links

In [4]:
def scroll_then_next():
    #scroll to the bottom of the page, then go to the next page
    # and give it time to load
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.find_elements_by_link_text("Next")[0].click()
    time.sleep(1.5)

### Now that I've set up my main functions, I'll go through the website and grab the data I want

I will change the settings to show 250 entries per page, and scrape 4 pages for a total of 1,000 data points

In [5]:
# open up a window and go to the page I want to scrape

chromedriver = "/Applications/chromedriver" # path to the chromedriver executable

driver = webdriver.Chrome(chromedriver)
driver.get('https://www.ratingraph.com/tv-shows/')
time.sleep(1)#make sure page has loaded

In [6]:
#change the settings to show 250 entries per page

show_entries = driver.find_elements_by_xpath("/html/body/div/main/section/div[2]/div[3]/label/select/option[5]")
show_entries[0].click()
time.sleep(1) #let page reload

In [7]:
#change the settings to have the minimum start year 1995
search_box = driver.find_element_by_xpath("//input[@id='filter_start_min']")
search_box.send_keys("1995")
time.sleep(1)
driver.find_elements_by_link_text("Apply")[0].click()

In [8]:
#grab data on page 1 and unpack them into stored lists
p11, p12 = grab_titles_and_links()
p13, p14, p15, p16, p17 = grab_cols4_to8()
time.sleep(.5+2*random.random())

#go to the next page and give it time to load
scroll_then_next()

In [9]:
#grab data on page 2 and unpack them into stored lists
p21, p22 = grab_titles_and_links()
p23, p24, p25, p26, p27 = grab_cols4_to8()
time.sleep(.5+2*random.random())

scroll_then_next()

In [10]:
#grab data on page 3 and unpack them into stored lists
p31, p32 = grab_titles_and_links()
p33, p34, p35, p36, p37 = grab_cols4_to8()
time.sleep(.5+2*random.random())

scroll_then_next()

In [11]:
#grab data on page 4 and unpack them into stored lists
p41, p42 =  grab_titles_and_links()
p43, p44, p45, p46, p47 = grab_cols4_to8()
time.sleep(.5+2*random.random())

In [12]:
driver.quit()

In [13]:
#concatenate all of the data into full lists
titles = p11 + p21 + p31 + p41
links = p12 + p22 + p32 + p42
start_years = p13 + p23 + p33 + p43
end_years = p14 + p24 + p34 + p44
genres = p15 + p25 + p35 + p45
seasons = p16 + p26 + p36 + p46
episodes = p17 + p27 + p37 + p47

In [14]:
len(titles)

1000

In [15]:
len(links)

1000

In [16]:
len(start_years)

1000

In [17]:
len(end_years)

1000

In [18]:
len(genres)

1000

In [19]:
len(seasons)

1000

In [20]:
len(episodes)

1000

In [21]:
links[:10]

['https://www.ratingraph.com/tv-shows/chernobyl-ratings-67251/',
 'https://www.ratingraph.com/tv-shows/game-of-thrones-ratings-26649/',
 'https://www.ratingraph.com/tv-shows/black-mirror-ratings-42161/',
 'https://www.ratingraph.com/tv-shows/sherlock-ratings-35707/',
 'https://www.ratingraph.com/tv-shows/breaking-bad-ratings-26165/',
 'https://www.ratingraph.com/tv-shows/the-mandalorian-ratings-69621/',
 'https://www.ratingraph.com/tv-shows/stranger-things-ratings-56080/',
 'https://www.ratingraph.com/tv-shows/the-witcher-ratings-58598/',
 'https://www.ratingraph.com/tv-shows/westworld-ratings-22144/',
 'https://www.ratingraph.com/tv-shows/dark-ratings-60934/']

#### now want to try and get the season 1 rating by using the links

In [33]:
driver = webdriver.Chrome(chromedriver)
driver.get(links[0])

### Since the season 1 rating is in an interactive graph in a tspan tag on the website, I'll need to grab everything in the tspan tag

In [None]:
#grab everything in the tspan tag from each link, then put the text in a list
all_ratings = []

for link in links:
    
    driver.get(link)
    time.sleep(.5+2*random.random())
    
    temp = driver.find_elements_by_tag_name('tspan')
    time.sleep(.5+2*random.random())
    
    all_ratings.append([elt.text for elt in temp])
    

In [81]:
driver.quit()

**The season 1 rating in particular is in the third element of each mini list within the all ratings lsit, so go through each mini list and grab the element at index 2.**

In [None]:
#some entries are empty, so put a placeholder in those
season_one_rating = [mini_list[2] if len(mini_list) > 0 else 'place_holder' for mini_list in all_ratings]

In [105]:
#find the empty lists, this will be where our place_holders are, i.e. where len == 0
for idx, mini_list in enumerate(all_ratings):
    if len(mini_list) == 0:
        print(idx, mini_list, titles[idx], seasons[idx], start_years[idx], end_years[idx], links[idx])
        

33 [] Narcos 3 2015 2017 https://www.ratingraph.com/tv-shows/narcos-ratings-47174/
57 [] Prison Break 5 2005 2017 https://www.ratingraph.com/tv-shows/prison-break-ratings-21193/
126 [] Unbelievable 1 2019 2019 https://www.ratingraph.com/tv-shows/unbelievable-ratings-69016/
171 [] The Man in the High Castle 4 2015 2019 https://www.ratingraph.com/tv-shows/the-man-in-the-high-castle-ratings-38650/
185 [] Criminal Minds 15 2005 2020 https://www.ratingraph.com/tv-shows/criminal-minds-ratings-21069/
230 [] Devs 1 2020 2020 https://www.ratingraph.com/tv-shows/devs-ratings-77287/
406 [] The New Pope 1 2019 - https://www.ratingraph.com/tv-shows/the-new-pope-ratings-66534/
716 [] Emergence 1 2019 2020 https://www.ratingraph.com/tv-shows/emergence-ratings-72282/


In [106]:
#now check for any other weird values (everything should either be place holder or start with season 1)
season_one_rating

['Season 1 (9.6)',
 'Season 1 (9.1)',
 'Season 1 (8.1)',
 'Season 1 (8.8)',
 'Season 1 (8.8)',
 'Season 1 (8.6)',
 'Season 1 (8.9)',
 'Season 1 (8.5)',
 'Season 1 (9.0)',
 'Season 1 (8.8)',
 'Season 1 (9.3)',
 'Season 1 (8.7)',
 'Season 1 (8.6)',
 'Season 1 (7.4)',
 '15 000',
 'Season 1 (8.9)',
 'Season 1 (8.8)',
 'Season 1 (8.7)',
 'Season 1 (8.7)',
 'Season 1 (9.1)',
 'Season 1 (9.0)',
 'Season 1 (8.7)',
 'Season 1 (8.7)',
 'Season 1 (8.9)',
 'Season 1 (9.1)',
 'Season 1 (8.6)',
 'Season 1 (8.8)',
 'Season 1 (8.4)',
 'Season 1 (8.7)',
 'Season 1 (8.5)',
 'Season 1 (8.0)',
 'Season 1 (8.4)',
 'Season 1 (8.8)',
 'place_holder',
 'Season 1 (8.7)',
 'Season 1 (8.9)',
 'Season 1 (8.7)',
 'Season 1 (8.7)',
 'Season 1 (8.6)',
 'Season 1 (8.6)',
 'Season 1 (8.8)',
 'Season 1 (8.6)',
 'Season 1 (8.5)',
 'Season 1 (8.1)',
 'Season 1 (8.2)',
 'Season 1 (8.8)',
 'Season 1 (8.2)',
 'Season 1 (7.0)',
 'Season 1 (8.4)',
 'Season 1 (8.4)',
 'Season 1 (8.7)',
 'Season 1 (8.7)',
 'Season 1 (8.9)',
 'S

In [107]:
#just need to fill in the place_holders, and the one element that's 15,000
season_one_rating[14]

'15 000'

In [108]:
print(titles[14], seasons[14], links[14])

Mr. Robot 4 https://www.ratingraph.com/tv-shows/mr-robot-ratings-54290/


In [109]:
#fill in missing values
season_one_rating[14] = 'Season 1 (9.0)'
season_one_rating[33] = 'Season 1 (8.8)'
season_one_rating[57] = 'Season 1 (8.5)'
season_one_rating[126] = 'Season 1 (8.5)'
season_one_rating[171] = 'Season 1 (8.2)'
season_one_rating[185] = 'Season 1 (7.7)'
season_one_rating[230] = 'Season 1 (8.0)'
season_one_rating[406] = 'Season 1 (8.1)'
season_one_rating[716] = 'Season 1 (8.0)'

In [None]:
#now put it all in a dataframe

In [111]:
tv_show_df = pd.DataFrame()
tv_show_df['Title'] = titles
tv_show_df['Start_Year'] = start_years
tv_show_df['End_Year'] = end_years
tv_show_df['Genres'] = genres
tv_show_df['Num_of_Seasons'] = seasons
tv_show_df['Num_of_Episodes'] = episodes
tv_show_df['Season_1_Rating'] = season_one_rating
tv_show_df['Links'] = links

In [113]:
tv_show_df.head(10)

Unnamed: 0,Title,Start_Year,End_Year,Genres,Num_of_Seasons,Num_of_Episodes,Season_1_Rating,Links
0,Chernobyl,2019,2019,"[Drama, History, Thriller]",1,5,Season 1 (9.6),https://www.ratingraph.com/tv-shows/chernobyl-...
1,Game of Thrones,2011,2019,"[Action, Drama, Adventure]",8,73,Season 1 (9.1),https://www.ratingraph.com/tv-shows/game-of-th...
2,Black Mirror,2011,-,"[Drama, Sci-fi, Thriller]",5,22,Season 1 (8.1),https://www.ratingraph.com/tv-shows/black-mirr...
3,Sherlock,2010,2017,"[Drama, Crime, Mystery]",4,12,Season 1 (8.8),https://www.ratingraph.com/tv-shows/sherlock-r...
4,Breaking Bad,2008,2013,"[Drama, Crime, Thriller]",5,62,Season 1 (8.8),https://www.ratingraph.com/tv-shows/breaking-b...
5,The Mandalorian,2019,-,"[Action, Adventure, Sci-fi]",1,8,Season 1 (8.6),https://www.ratingraph.com/tv-shows/the-mandal...
6,Stranger Things,2016,-,"[Drama, Fantasy, Horror]",3,25,Season 1 (8.9),https://www.ratingraph.com/tv-shows/stranger-t...
7,The Witcher,2019,-,"[Action, Drama, Fantasy, Adventure]",1,8,Season 1 (8.5),https://www.ratingraph.com/tv-shows/the-witche...
8,Westworld,2016,-,"[Drama, Mystery, Sci-fi]",3,28,Season 1 (9.0),https://www.ratingraph.com/tv-shows/westworld-...
9,Dark,2017,2020,"[Drama, Crime, Mystery]",3,26,Season 1 (8.8),https://www.ratingraph.com/tv-shows/dark-ratin...


In [114]:
#pickle our data
tv_show_df.to_pickle('original_tv_show_df.pkl')

## scratch work below

In [None]:
tspans = [x.text for x in driver.find_elements_by_tag_name('tspan')]

In [51]:
tspans

['Episodes',
 'Rating',
 'Season 1 (9.1)',
 'Season 2 (9.0)',
 'Season 3 (9.1)',
 'Season 4 (9.3)',
 'Season 5 (8.8)',
 'Season 6 (9.1)',
 'Season 7 (9.1)',
 'Season 8 (6.4)',
 'Seasons trendline',
 'Episodes',
 'Votes',
 'Season 1 (30,569.1)',
 'Season 2 (26,035.0)',
 'Season 3 (31,222.3)',
 'Season 4 (34,464.9)',
 'Season 5 (33,802.0)',
 'Season 6 (61,941.0)',
 'Season 7 (52,525.0)',
 'Season 8 (168,007.3)',
 'Seasons trendline',
 'Date',
 'Votes',
 'Rating',
 'Total votes',
 'Average rating',
 "May '20",
 "Jul '20",
 "Sep '20",
 '3 420k',
 '3 450k',
 '3 480k',
 '3 510k',
 '3 540k',
 '3 570k',
 '3 600k',
 '3 630k']

In [None]:
for link in links:
    tspans = [x.text for x in driver.find_elements_by_tag_name('tspan')]

In [61]:
driver.get(links[508])

In [62]:
four = driver.find_elements_by_tag_name('tspan')

In [63]:
four[2].text

'Season 1 (8.6)'

In [69]:
all_ratings = []
for link in links:
    driver.get(link)
    time.sleep(.5+2*random.random())
    temp = driver.find_elements_by_tag_name('tspan')
    time.sleep(.5+2*random.random())
    all_ratings.append([elt.text for elt in temp])
    

In [70]:
len(all_ratings)

1000

In [77]:
for x in all_ratings[:10]:
    print(x[2])

Season 1 (9.6)
Season 1 (9.1)
Season 1 (8.1)
Season 1 (8.8)
Season 1 (8.8)
Season 1 (8.6)
Season 1 (8.9)
Season 1 (8.5)
Season 1 (9.0)
Season 1 (8.8)


In [91]:
season_one_rating = []
for x in all_ratings:
    print(len(x))
    #season_one_rating.append(x[2])
#print(season_one_rating)

18
42
28
26
36
18
24
18
24
24
24
26
46
18
26
28
24
19
18
18
26
28
22
24
18
24
22
26
30
24
42
22
26
0
28
34
18
26
30
30
26
26
30
30
24
28
26
26
34
26
30
30
26
24
32
48
28
0
22
23
32
22
36
36
36
30
26
24
18
18
30
22
24
22
18
28
26
26
22
24
22
26
22
30
22
32
26
26
28
32
24
26
22
34
30
22
26
34
18
24
32
42
30
24
30
36
30
26
32
32
46
26
18
22
30
26
64
32
26
30
30
28
22
32
36
26
0
30
26
18
30
30
28
26
30
26
34
34
22
36
32
26
30
24
36
32
26
26
34
24
25
28
22
30
22
28
22
26
32
26
18
22
28
22
38
26
32
26
32
26
19
0
18
22
32
30
18
24
27
32
30
30
28
28
21
0
32
26
30
34
26
22
32
18
36
26
32
26
30
24
26
20
32
32
18
18
22
32
38
36
30
32
30
40
26
18
18
24
22
28
38
28
22
32
30
32
18
30
34
24
0
30
26
26
24
18
18
18
32
26
18
30
32
30
30
24
22
24
34
34
26
32
22
34
26
30
23
18
30
24
36
40
18
22
18
30
38
18
38
38
18
46
56
18
18
28
32
18
32
50
22
18
22
32
20
34
30
36
32
40
21
40
30
28
34
32
23
28
22
38
22
18
32
40
18
38
30
34
30
26
30
22
18
40
32
18
42
38
30
30
26
32
24
18
18
18
18
18
40
26
28
32
29
18
36
2

In [96]:
#found some empty lists in the rating system, let's find them and fill those in
for idx, mini_list in enumerate(all_ratings):
    if len(mini_list) == 0:
        print(idx, mini_list, titles[idx], seasons[idx], start_years[idx], end_years[idx], links[idx])
        

33 [] Narcos 3 2015 2017 https://www.ratingraph.com/tv-shows/narcos-ratings-47174/
57 [] Prison Break 5 2005 2017 https://www.ratingraph.com/tv-shows/prison-break-ratings-21193/
126 [] Unbelievable 1 2019 2019 https://www.ratingraph.com/tv-shows/unbelievable-ratings-69016/
171 [] The Man in the High Castle 4 2015 2019 https://www.ratingraph.com/tv-shows/the-man-in-the-high-castle-ratings-38650/
185 [] Criminal Minds 15 2005 2020 https://www.ratingraph.com/tv-shows/criminal-minds-ratings-21069/
230 [] Devs 1 2020 2020 https://www.ratingraph.com/tv-shows/devs-ratings-77287/
406 [] The New Pope 1 2019 - https://www.ratingraph.com/tv-shows/the-new-pope-ratings-66534/
716 [] Emergence 1 2019 2020 https://www.ratingraph.com/tv-shows/emergence-ratings-72282/


In [104]:
#now check that everything is either place holder or starts with season 1
season_one_rating

['Season 1 (9.6)',
 'Season 1 (9.1)',
 'Season 1 (8.1)',
 'Season 1 (8.8)',
 'Season 1 (8.8)',
 'Season 1 (8.6)',
 'Season 1 (8.9)',
 'Season 1 (8.5)',
 'Season 1 (9.0)',
 'Season 1 (8.8)',
 'Season 1 (9.3)',
 'Season 1 (8.7)',
 'Season 1 (8.6)',
 'Season 1 (7.4)',
 '15 000',
 'Season 1 (8.9)',
 'Season 1 (8.8)',
 'Season 1 (8.7)',
 'Season 1 (8.7)',
 'Season 1 (9.1)',
 'Season 1 (9.0)',
 'Season 1 (8.7)',
 'Season 1 (8.7)',
 'Season 1 (8.9)',
 'Season 1 (9.1)',
 'Season 1 (8.6)',
 'Season 1 (8.8)',
 'Season 1 (8.4)',
 'Season 1 (8.7)',
 'Season 1 (8.5)',
 'Season 1 (8.0)',
 'Season 1 (8.4)',
 'Season 1 (8.8)',
 'place_holder',
 'Season 1 (8.7)',
 'Season 1 (8.9)',
 'Season 1 (8.7)',
 'Season 1 (8.7)',
 'Season 1 (8.6)',
 'Season 1 (8.6)',
 'Season 1 (8.8)',
 'Season 1 (8.6)',
 'Season 1 (8.5)',
 'Season 1 (8.1)',
 'Season 1 (8.2)',
 'Season 1 (8.8)',
 'Season 1 (8.2)',
 'Season 1 (7.0)',
 'Season 1 (8.4)',
 'Season 1 (8.4)',
 'Season 1 (8.7)',
 'Season 1 (8.7)',
 'Season 1 (8.9)',
 'S