# Lecture 6: How to get the data
aka "Why BeautifulSoup is cool"

(The present lecture is inspired by the the webpage [here](https://www.dataquest.io/blog/web-scraping-beautifulsoup/))

## Loading python modules

[requests](http://docs.python-requests.org/en/master/) permits to get a webpage.
[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) navigates it.
<br>**PAY ATTENTION!** request**S** (request is a different module)

In [140]:
from requests import get
from bs4 import BeautifulSoup
import bs4

# IMDB

What is [IMDB](https://en.wikipedia.org/wiki/IMDb)?

The webpage of [IMDB search](https://www.imdb.com/search/title)

## requests in action

In [6]:
url = 'http://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'

In [7]:
response = get(url,headers = {"Accept-Language": "en-US, en;q=0.5"})

With the previous specification get accepts US English and generically English, with a strictness of 0.5 (I mean, quite weak).

In [5]:
print response.text[:500]



<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle",


Cool, but as a text I cannot do something really interesting...

## Beautiful Soup is indeed beautiful
The interesting thing of BeautifulSoup is that it permits you to navigate the webpage

### Getting a navigable webpage

In [8]:
html_soup = BeautifulSoup(response.text, 'html.parser')

Here we are using the default parser, but in principle there are [others](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser).

In [11]:
html_soup



The html page has different attributes. Let us play a little with some of them. 

In [18]:
html_soup.head

<head>\n<meta charset="unicode-escape"/>\n<meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>\n<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>\n<script>\n    if (typeof uet == 'function') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<title>IMDb: Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>\n<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == 'function') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex == 'function') {\n      uex("ld", "LoadTitle", {wb: 1});\n    }\n</script>\n<link href="https://www.imdb.com/search/

In [16]:
testa=html_soup.head

In [23]:
testa.contents

[u'\n',
 <meta charset="unicode-escape"/>,
 u'\n',
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>,
 u'\n',
 <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>,
 u'\n',
 <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>,
 u'\n',
 <script>\n    if (typeof uet == 'function') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>,
 u'\n',
 <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>,
 u'\n',
 <title>IMDb: Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>,
 u'\n',
 <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>,
 u'\n',
 <script>\n    if (typeof uet == 'function') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>,
 u'\n',
 <script>\n    if (typeof uex == 'function') {\n      uex("ld", 

Navigate by considering the element of the list of contents...

In [34]:
testa.contents[13]

<title>IMDb: Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>

In [35]:
testa.contents[13].name

u'title'

...or using the tags:

In [36]:
testa.title

<title>IMDb: Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>

[How to navigate in a soup?](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree)
<br/>An HTML page is organized hierarchically, with parents, children and descendants. Check the link for a more detailed guide.

### Finding the right tags...

I mean, BeautifulSoup is indeed beautiful, but without knowing what we are looking for it is a quite a mess...
<br/> Let us investigate the structure of the HTML. This can be done, by going in the **Developer Tools** of your browser (the images are taken from [here](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)). On Chrome:<br>
!['developer tools'](l6_developer_tools.png 'Inspect!')

There is something similar in each browser. On Firefox:
![Firefox](l6_developer_tools_firefox.png 'Sorry, I left it in Italian!')

Nevertheless, the important thing is that it permits you to examine the html structure. 
Going back to Chrome, if you pass the arrow over a certain movie
![Logan](l6_container.png 'Still Marvel...')

Ok, so the information we are looking for is contained in a 'div' tag. But there is a lot of them! Indeed they contain all we need. For instance, the title is here:
![here](l6_h3_title.png)

... the rating is here...
![Substructure](l6_rating.png 'Here!')

... the total number of votes here...
![votes](l6_votes.png)

... the metascore is here...
![cast](l6_metascore.png)

... the cast and the directors are here!
![cast](l6_directors_actors.png)

### Going back to our soup

It select all 'div' tags with class 'lister-item mode-advanced' (as the one containing the information about 'Logan').

In [46]:
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')

In [49]:
print len(movie_containers)

50


Ok, it starts looking nicer: we have exactly the same number of elements as the results in the search page. But is it indeed nice?

### Refining our search

In [51]:
first_movie = movie_containers[0]
first_movie

<div class="lister-item mode-advanced">\n<div class="lister-top-right">\n<div class="ribbonize" data-caller="filmosearch" data-tconst="tt4154756"></div>\n</div>\n<div class="lister-item-image float-left">\n<a href="/title/tt4154756/?ref_=adv_li_i"> <img alt="Avengers: Infinity War" class="loadlate" data-tconst="tt4154756" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMjMxNjY2MDU1OV5BMl5BanBnXkFtZTgwNzY1MTUwNTM@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png" width="67"/>\n</a> </div>\n<div class="lister-item-content">\n<h3 class="lister-item-header">\n<span class="lister-item-index unbold text-primary">1.</span>\n<a href="/title/tt4154756/?ref_=adv_li_tt">Avengers: Infinity War</a>\n<span class="lister-item-year text-muted unbold">(2018)</span>\n</h3>\n<p class="text-muted ">\n<span class="certificate">PG-13</span>\n<span class="ghost">|</span>\n<span class="runtime">149 min</span>\n

#### The title

It could be nicer... Anyway, that's not a big issue since we know that the title is inside the tag 'h3'...

In [52]:
first_movie.h3

<h3 class="lister-item-header">\n<span class="lister-item-index unbold text-primary">1.</span>\n<a href="/title/tt4154756/?ref_=adv_li_tt">Avengers: Infinity War</a>\n<span class="lister-item-year text-muted unbold">(2018)</span>\n</h3>

... then let us look for the inner tag with the title...

In [55]:
first_movie.h3.a

<a href="/title/tt4154756/?ref_=adv_li_tt">Avengers: Infinity War</a>

... and get the text:

In [56]:
first_movie.h3.a.text

u'Avengers: Infinity War'

**Cool!** We can get even the year (if we would have been interested...)

#### The year

In [68]:
first_movie.h3.span

<span class="lister-item-index unbold text-primary">1.</span>

Where is the second one? BeautifulSoup consider only the first item it finds with that name.

In [74]:
first_movie.find('span', class_="lister-item-year text-muted unbold").text

u'(2018)'

#### The rating

The rating is instead contained in the strong tag...

In [84]:
first_movie.strong.text

u'8.5'

#### The metascore

The metascore is still in another 'span':

In [85]:
first_movie.find('span', class_= "metascore favorable").text

u'68        '

#### The number of voters

The number of votes is a little more involved: we have to look for the 'span' which has attribute 'name' with value 'nv': 

In [95]:
first_movie.find('span', attrs = {'name':'nv'})

<span data-value="530929" name="nv">530,929</span>

In [92]:
aux=first_movie.find('span', attrs = {'name':'nv'})

Final trick:

In [93]:
aux.text

u'530,929'

whose datatype is

In [94]:
type(aux.text)

unicode

In [96]:
aux['data-value']

u'530929'

In [97]:
int(aux.text)

ValueError: invalid literal for int() with base 10: '530,929'

In [98]:
int(aux['data-value'])

530929

Better considering the second one!

### Exercise: Get the actors and the directors for first_movie

#### Fabio's solution

In [154]:
first_movie.find_all('p', class_="")

[<p class="text-muted ">\n<span class="certificate">PG-13</span>\n<span class="ghost">|</span>\n<span class="runtime">149 min</span>\n<span class="ghost">|</span>\n<span class="genre">\nAction, Adventure, Fantasy            </span>\n</p>,
 <p class="">\n    Directors:\n<a href="/name/nm0751577/?ref_=adv_li_dr_0">Anthony Russo</a>, \n<a href="/name/nm0751648/?ref_=adv_li_dr_1">Joe Russo</a>\n<span class="ghost">|</span> \n    Stars:\n<a href="/name/nm0000375/?ref_=adv_li_st_0">Robert Downey Jr.</a>, \n<a href="/name/nm1165110/?ref_=adv_li_st_1">Chris Hemsworth</a>, \n<a href="/name/nm0749263/?ref_=adv_li_st_2">Mark Ruffalo</a>, \n<a href="/name/nm0262635/?ref_=adv_li_st_3">Chris Evans</a>\n</p>]

In [155]:
aux_p=first_movie.find_all('p', class_="")

In [156]:
for aux in aux_p:
    if aux.text.find('Director')>0:
        da_tag=aux

In [157]:
da_vec=[[],[]]

helper=0
for tagga in da_tag.contents:
    if type(tagga)==bs4.element.NavigableString:
        if tagga.find('Star')>0:
            helper+=1
    else:
        if tagga.text!='|':
            da_vec[helper].append(tagga.text)

In [158]:
da_vec[0]

[u'Anthony Russo', u'Joe Russo']

In [159]:
da_vec[1]

[u'Robert Downey Jr.', u'Chris Hemsworth', u'Mark Ruffalo', u'Chris Evans']

### Exercise: download all movies with a metascore among the first 50 (disregard the directors and the cast)

#### Fabio's solution

In [163]:
fifty=np.zeros(50, dtype=[('title','S30'),('year','i4'),('rating','f4'),('metascore','i4'),('votes','i4')])

In [177]:
offset=0
for i in xrange(50):
    movie=movie_containers[i]
    if movie.find('div', class_ = 'ratings-metascore') is not None:
        fifty[i-offset]['title']=movie.h3.a.text
        print movie.h3.a.text
        aux=movie.find('span', class_="lister-item-year text-muted unbold").text
        fifty[i-offset]['year']=aux[-5:-1]
        fifty[i-offset]['rating']=movie.strong.text
        fifty[i-offset]['metascore']=movie.find('span', class_= "metascore").text
        fifty[i-offset]['votes']=movie.find('span', attrs = {'name':'nv'})['data-value']
    else:
        offset+=1

Avengers: Infinity War
Black Panther
Deadpool 2
Ready Player One
A Quiet Place
Annihilation
Jurassic World: Fallen Kingdom
Solo: A Star Wars Story
Mission: Impossible - Fallout
Ant-Man and the Wasp
Venom
Incredibles 2
Tomb Raider
Bohemian Rhapsody
Game Night
Red Sparrow
A Star Is Born
Ocean's Eight
Hereditary
Rampage
Isle of Dogs
Maze Runner: The Death Cure
Pacific Rim: Uprising
The Meg
The Cloverfield Paradox
Upgrade
Fantastic Beasts: The Crimes of Grindelwald
The Commuter
Sicario: Day of the Soldado
Love, Simon
The Nun
BlacKkKlansman
Skyscraper
Tag
The Equalizer 2
Den of Thieves
First Man
The Predator
Halloween
Crazy Rich Asians
Searching
To All the Boys I've Loved Before
Mamma Mia! Here We Go Again
Blockers
12 Strong


In [181]:
fifty=fifty[fifty['year']>0]

### Exercise: download the first 400 movies with a metascore (disregard the directors and the cast)
Hints: 
- [ASCII vs UNICODE](https://stackoverflow.com/questions/2365411/convert-unicode-to-ascii-without-errors-in-python)
- [time module and sleep](https://docs.python.org/2/library/time.html#time.sleep) (Do not access too much in order to avoid to get banned)

#### Fabio's solution

In [213]:
import time
import datetime as dt

##### asciify

In [197]:
def asciify(s):
    # In case of non ascii file, it returns a '?'
    return "".join([x if ord(x) < 128 else '?' for x in s])

In ascii we have just 128 characters. In Unicode, more. [ord](https://docs.python.org/2/library/functions.html#ord) returns the unicode position of the argument.

In [203]:
[ord(i) for i in 'caccä']

[99, 97, 99, 99, 195, 164]

##### Loading

In [207]:
films=np.zeros(400, dtype=[('title','S50'),('year','i4'),('rating','f4'),('metascore','i4'),('votes','i4')])

In [217]:
position=0
page=0
while position<400:
    url = 'https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&start='+str(1+page*50)+'&ref_=adv_nxt'
    print '{0:%H:%M:%S } page={1:d}'.format(dt.datetime.now(),page)
    response = get(url,headers = {"Accept-Language": "en-US, en;q=0.5"})
    html_soup = BeautifulSoup(response.text, 'html.parser')
    movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
    for i in xrange(50):
        if position>=400:
            break
        else:
            movie=movie_containers[i]
            if movie.find('div', class_ = 'ratings-metascore') is not None:
                films[position]['title']=asciify(movie.h3.a.text)
                print '{0:%H:%M:%S} {1:d}) {2:s}'.format(dt.datetime.now(),position, films[position]['title'])
                aux=movie.find('span', class_="lister-item-year text-muted unbold").text
                films[position]['year']=aux[-5:-1]
                films[position]['rating']=movie.strong.text
                films[position]['metascore']=movie.find('span', class_= "metascore").text
                films[position]['votes']=movie.find('span', attrs = {'name':'nv'})['data-value']
                position+=1
    page+=1
    if page % 10==0:
        print '{0:%H:%M:%S} Taking a nap...'.format(dt.datetime.now())
        time.sleep(60)

12:50:58  page=0
12:51:01 0) Avengers: Infinity War
12:51:01 1) Black Panther
12:51:01 2) Deadpool 2
12:51:01 3) Ready Player One
12:51:01 4) A Quiet Place
12:51:01 5) Annihilation
12:51:01 6) Jurassic World: Fallen Kingdom
12:51:01 7) Solo: A Star Wars Story
12:51:01 8) Mission: Impossible - Fallout
12:51:01 9) Ant-Man and the Wasp
12:51:01 10) Venom
12:51:01 11) Incredibles 2
12:51:01 12) Tomb Raider
12:51:01 13) Bohemian Rhapsody
12:51:01 14) Game Night
12:51:01 15) Red Sparrow
12:51:01 16) A Star Is Born
12:51:01 17) Ocean's Eight
12:51:01 18) Hereditary
12:51:01 19) Rampage
12:51:01 20) Isle of Dogs
12:51:01 21) Maze Runner: The Death Cure
12:51:01 22) Pacific Rim: Uprising
12:51:01 23) The Meg
12:51:01 24) The Cloverfield Paradox
12:51:01 25) Fantastic Beasts: The Crimes o
12:51:01 26) Upgrade
12:51:01 27) The Commuter
12:51:01 28) Sicario: Day of the Soldado
12:51:01 29) Love, Simon
12:51:01 30) The Nun
12:51:01 31) BlacKkKlansman
12:51:01 32) Skyscraper
12:51:01 33) Tag
12:51:0

12:52:41 272) Zanna Bianca
12:52:41 273) Elizabeth Harvest
12:52:41 274) Birds of Passage
12:52:41 275) Boarding School
12:52:41 276) First Match
12:52:41 277) The Bleeding Edge
12:52:41 Taking a nap...
12:53:41  page=20
12:53:43 278) Nobody's Fool
12:53:43 279) Come Sunday
12:53:43 280) Jonathan
12:53:44  page=21
12:53:46 281) A Kid Like Jake
12:53:46 282) The Rachel Divide
12:53:46  page=22
12:53:48 283) The Festival
12:53:48 284) Boundaries
12:53:48 285) What Keeps You Alive
12:53:48 286) Do You Trust this Computer?
12:53:48 287) The Possession of Hannah Grace
12:53:48 288) Coldplay: A Head Full of Dream
12:53:48  page=23
12:53:50 289) A Private War
12:53:50 290) We the Animals
12:53:50 291) The Front Runner
12:53:50  page=24
12:53:52 292) Ash Is Purest White
12:53:52 293) Skate Kitchen
12:53:52 294) Dog Days
12:53:52 295) Madeline's Madeline
12:53:53 296) An Evening with Beverly Luff L
12:53:53  page=25
12:53:55 297) The Seagull
12:53:55 298) Detective Chinatown 2
12:53:55 299) Jos

##### Some checks

In [223]:
films[films['metascore']==np.max(films['metascore'])]

array([('Un affare di famiglia', 2018, 8.1, 95, 5095),
       ('Roma', 2018, 8.6, 95, 2509)],
      dtype=[('title', 'S30'), ('year', '<i4'), ('rating', '<f4'), ('metascore', '<i4'), ('votes', '<i4')])

In [224]:
films[films['metascore']==np.min(films['metascore'])]

array([('Death of a Nation', 2018, 4.9, 1, 5029)],
      dtype=[('title', 'S30'), ('year', '<i4'), ('rating', '<f4'), ('metascore', '<i4'), ('votes', '<i4')])

In [226]:
films[films['rating']==np.max(films['rating'])]

array([('They Shall Not Grow Old', 2018, 8.7, 88, 3464),
       ('Free Solo', 2018, 8.7, 83, 1935)],
      dtype=[('title', 'S30'), ('year', '<i4'), ('rating', '<f4'), ('metascore', '<i4'), ('votes', '<i4')])

In [225]:
films[films['rating']==np.min(films['rating'])]

array([('Future World', 2018, 3.1, 10, 3417),
       ('Delirium', 2018, 3.1, 27,  592)],
      dtype=[('title', 'S30'), ('year', '<i4'), ('rating', '<f4'), ('metascore', '<i4'), ('votes', '<i4')])

### Exercise: build on the fly the edgelist of the bipartite network actors/movies for the first 400 films

#### Fabio's solution

Respect to the previous case the problem is that we do not know in advance what is the number of edges.

In [293]:
position=0
page=0
movie_list=[]
actor_list=[]
while position<400:
    url = 'https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&start='+str(1+page*50)+'&ref_=adv_nxt'
    print '{0:%H:%M:%S } page={1:d}'.format(dt.datetime.now(),page)
    response = get(url,headers = {"Accept-Language": "en-US, en;q=0.5"})
    html_soup = BeautifulSoup(response.text, 'html.parser')
    movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
    for i in xrange(50):
        if position>=400:
            break
        else:
            movie=movie_containers[i]
            if movie.find('div', class_ = 'ratings-metascore') is not None:
                title=asciify(movie.h3.a.text)
                print '{0:%H:%M:%S} {1:d}) {2:50s}'.format(dt.datetime.now(),position, title)
                
                aux_p=movie.find_all('p', class_="")
                for aux in aux_p:
                    if aux.text.find('Director')>0:
                        da_tag=aux
                helper=0
                for tagga in da_tag.contents:
                    if type(tagga)==bs4.element.NavigableString:
                        if tagga.find('Star')>0:
                            helper+=1
                    else:
                        if tagga.text!='|' and helper==1:
                            movie_list.append(title)
                            actor_list.append(asciify(tagga.text))
                position+=1
    page+=1
    if page % 10==0:
        print '{0:%H:%M:%S} Taking a nap...'.format(dt.datetime.now())
        time.sleep(60)
print '{0:%H:%M:%S} Download finished, checking lists={1:b}'.format(dt.datetime.now(),len(movie_list)==len(actor_list))
    
outcome=np.zeros(len(actor_list), dtype=[('film','S40'),('actor','S40')])
outcome['film']=movie_list
outcome['actor']=actor_list
        

15:26:31  page=0
15:26:32 0) Avengers: Infinity War                            
15:26:32 1) Black Panther                                     
15:26:33 2) Deadpool 2                                        
15:26:33 3) Ready Player One                                  
15:26:33 4) A Quiet Place                                     
15:26:33 5) Annihilation                                      
15:26:33 6) Jurassic World: Fallen Kingdom                    
15:26:33 7) Solo: A Star Wars Story                           
15:26:33 8) Mission: Impossible - Fallout                     
15:26:33 9) Ant-Man and the Wasp                              
15:26:33 10) Venom                                             
15:26:33 11) Incredibles 2                                     
15:26:33 12) Tomb Raider                                       
15:26:33 13) Bohemian Rhapsody                                 
15:26:33 14) Game Night                                        
15:26:33 15) Red Sparrow         

15:26:39 127) Destination Wedding                               
15:26:39 128) Teen Titans Go! To the Movies                     
15:26:39 129) The Night Comes for Us                            
15:26:39  page=4
15:26:41 130) Il Grinch                                         
15:26:41 131) Cold War                                          
15:26:41 132) Unfriended: Dark Web                              
15:26:41 133) Don't Worry, He Won't Get Far on Foot             
15:26:41 134) Occupation                                        
15:26:41 135) The Nutcracker and the Four Realms                
15:26:41 136) 7 Days in Entebbe                                 
15:26:41 137) The Girl in the Spider's Web                      
15:26:41 138) Breaking In                                       
15:26:41 139) The Tale                                          
15:26:41 140) In Darkness                                       
15:26:41 141) Hunter Killer                                     
15:26:41

15:28:03 252) Free Solo                                         
15:28:03 253) Summer                                            
15:28:03  page=15
15:28:05 254) Puzzle                                            
15:28:05 255) Museo                                             
15:28:05  page=16
15:28:08 256) Birthmarked                                       
15:28:08 257) Life Itself                                       
15:28:08 258) Puppet Master: The Littlest Reich                 
15:28:08 259) They'll Love Me When I'm Dead                     
15:28:08 260) Support the Girls                                 
15:28:08 261) Quincy                                            
15:28:08  page=17
15:28:10 262) Slice                                             
15:28:10 263) La donna dello scrittore                          
15:28:10 264) Heavy Trip                                        
15:28:10  page=18
15:28:12 265) The Dark                                          
15:28:12 266) Duck

15:32:16 370) Reversing Roe                                     
15:32:16 371) Welcome Home                                      
15:32:16 372) Double Lives                                      
15:32:16 373) Prospect                                          
15:32:16 374) The Escape of Prisoner 614                        
15:32:16 375) The Image Book                                    
15:32:16 376) Ruben Brandt, Collector                           
15:32:16 377) In a Relationship                                 
15:32:16 378) The Ranger                                        
15:32:16  page=44
15:32:18 379) Pity                                              
15:32:18 380) Loveling                                          
15:32:18 381) In guerra                                         
15:32:18 382) Eva                                               
15:32:18  page=45
15:32:21 383) The Long Dumb Road                                
15:32:21 384) Supercon                                

#### Check

4 stars for each film

In [294]:
film_edli,film_edli_k= np.unique(outcome['film'], return_counts=True)

In [295]:
np.unique(film_edli_k)

array([1, 2, 3, 4])

In [296]:
actor_edli, actor_edli_k=np.unique(outcome['actor'],return_counts=True)

In [297]:
actor_edli[actor_edli_k==np.max(actor_edli_k)]

array(['Paul Rudd', 'Rashida Jones'], dtype='|S40')