## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with through Wednesday of this week. The HTML page on the BBC site poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies then immediately follow her/him. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just email me and I will send you the code.)

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, email me and I will send you working code so you can move on to the next step.


In [6]:
##
##grouping by critics country:  
##get tabke set by Thursday

### Getting started: Data Architecture
You can come up with your own data scheme for this, but the one I'm recommending is three separate lists:

The central challenge of this project it's figuring out how you are going to set up your table or tables from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: the main categories of analysis that are possible include movie, director, critic, critic's country, year, and whatever else you bring to this. Try to design a schema that will give you a table that you can run solid queries on. 

For this project, if you're interested in recalling your knowledge of SQL, you can do the additional step of entering your transformed data into postgres. Or you can just stick with pandas.

### Interpretive Architecture
**REMEMBER: secondary source** Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.

You don't necessarily have to go in the direction of directors' origin. You can certainly try to think of other categories of interpretation that you can join to this initial dataset. This is how you bring your point-of-view to a relatively large data set that seeks to frame the past 15 years of cinema. How can you bring a different point-of-view to this subject? You can certainly narrow your focus to a specific country, the group of countries, or a region. Either way, think about other data that might bring different types of insight to this list.

### Ready to code?

The first thing you need to do is import beautiful soup & requestions like we did in the homework, and scrape the page. http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted


One thing I should note there are two inconsistencies (actual errors in the HTML) that will cause you to lose a couple entries (which is okay but may be frustrating). I have posted a version of the exact same page with those inconsistencies fixed, if you want to scrape from that page: 

http://floatingmedia.com/columbia/BBC.html

It's up to you. Okay let's begin!

STEP 1:


In [11]:
##Import your libraries: Beautiful soup, urllib, and re (For regular expressions)
import requests
from bs4 import BeautifulSoup
import pandas as pd



In [12]:
# read the URL, and put the HTML page into beautiful soup
response = requests.get('http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted')
doc = BeautifulSoup(response.text, 'html.parser')


In [13]:
#Using beautiful soup find the div tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 
all_info = doc.find_all(class_= 'body-content')
all_info


[<div class="body-content">
 <p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.</p><p><strong>Simon Abrams – Freelance film critic (US)</strong></p><p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad 

**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [14]:
#find_all
all_p = doc.find_all('p')
all_p[-2]
#.find_next_sibling()

<p class="date-item-content">Sep</p>

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the `<p>` elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard--every critic entry is embedded in `<strong>` tags. But in order to get the movies attached to that critic--you need to find the `<p>` tag immediately following each `<p><strong>` -- you can do this using next_sibling.

So, you need to build a loop that searches to your `all_p` list:

if it has a `<strong>` tag then 
critic_info = p_line.strong.string
movie_info = p_line.next_sibling

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece. If you want to see the overall architecture of the final loop, I have a commented example at the end of the page--it might not be helpful to look at at this point. See how you do step-by-step and if you get stuck at a step email me with your code!



In [15]:
##Write your loop for STEP 3 here
#I started this for you,
#Because you only want it to search starting with each critic
#   if line.strong is not None: does that for you
for lines in all_p[0:-1]:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        movie_info = lines.find_next_sibling().text
        print(critic_info)
        print(movie_info)
        print("-------------")




Simon Abrams – Freelance film critic (US)
1. Mulholland Drive (David Lynch, 2001)2. In the Mood for Love (Wong Kar-wai, 2000)3. The Tree of Life (Terrence Malick, 2011)4. Yi Yi: A One and a Two (Edward Yang, 2000)5. Goodbye to Language (Jean-Luc Godard, 2014)6. The White Meadows (Mohammad Rasoulof, 2009)7. Night Across the Street (Raoul Ruiz, 2012)8. Certified Copy (Abbas Kiarostami, 2010)9. Sparrow (Johnnie To, 2008)10. Fados (Carlos Saura, 2007)
-------------
Sam Adams – Freelance film critic (US)
1. In the Mood for Love (Wong Kar-wai, 2000)2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)4. Spirited Away (Hayao Miyazaki, 2001)5. The Act of Killing (Joshua Oppenheimer, 2012)6. The Grand Budapest Hotel (Wes Anderson, 2014)7. The New World (Terrence Malick, 2004)8. Certified Copy (Abbas Kiarostami, 2010)9. The World (Jia Zhangke, 2004)10. Elephant (Gus Van Sant, 2003)
-------------
Thelma Adams – Freelance film cr

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [16]:
#Practice/Build your regular expressions here
import re
crit_sample1 = "Wen Tien-Hsiang – Taipei Golden Horse Film Festival (Taiwan)"
crit_sample2 = "Jean-Philippe Guerand – L'Avant-Scène Cinéma (France)"
regex_for_name = r"^\w+[-]?\s?\w+[-]?\s?\w+"
regex_for_org = r"\w\s[^-]\s(\w.*) [()]"
regex_for_cn = r"[(](\w+\s?\w+)[)]"
name = re.findall(regex_for_name,crit_sample1)
name[0]

'Wen Tien-Hsiang'

That is the trick! The way that you're scraping it, you are getting the critics information and then you're getting a list inside that critics list with each of their movies. So to get a movie you would need to navigate down like this mylist[0][0] would get you 'Simon Abrams', mylist[0][4][0] what get you 'Mulholland Drive'. mylist[0][4][12] would get you 'Goodbye to Languag'e. This is not the best format to put into data frames. So one of the challenges is figuring out what the best kind of row would look like. You will want to modify the way you're building these lists when your loop to get a data frame that works well.

In [13]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it

In [17]:
import re
for lines in all_p[3:-1]:
    if lines.strong is not None:
        critic_info = lines.find('strong').text
        regex_for_name = r"^\w+[-]?\s?\w+[-]?\s?\w+"
        regex_for_org =  r"\w\s[^-]\s(\w.*) [()]"
        regex_for_cn = r"[(](\w+\s?\w+)[)]"
        try:
            critics_name = re.findall(regex_for_name, critic_info)[0]
            critics_org = re.findall(regex_for_org, critic_info)[0]
            critics_cn = re.findall(regex_for_cn, critic_info)[0]
            movie_info = lines.find_next_sibling().text
            print(critics_name)
            print(critics_org)
            print(critics_cn)
        except:
            pass
#         print(critic_info)

        #print(movie_info)  
#     print("-------------")

Simon Abrams
Freelance film critic
US
Sam Adams
Freelance film critic
US
Thelma Adams
Freelance film critic
US
Arturo Aguilar
Rolling Stone Mexico
Mexico
Matthew Anderson
BBC Culture
UK
Tim Appelo
The Wrap
US
Adriano Aprà
Film historian
Italy
Michael Arbeiter
Nerdist
US
Ali Arikan
Dipnot TV
Turkey
Michael Atkinson
The Village Voice
US
Ana Maria Bahiana
Freelance film critic
Brazil
Cameron Bailey
Toronto Film Festival
Canada
Lindsay Baker
BBC Culture
UK
Miriam Bale
Freelance film critic
US
Nicholas Barber
BBC Culture
UK
Diego Batlle
La Nacion
Argentina
NT Binh
Positif
France
Lizelle Bisschoff
University of Glasgow
UK
Christian Blauvelt
BBC Culture
US
Mahen Bonetti
African Film Festival Inc
US
Andreas Borcholte
Spiegel Online
Germany
Utpal Borpujari
Freelance film critic
India
Richard Brody
The New Yorker
US
Hannah Brown
Jerusalem Post
Israel
Luke Buckmaster
The Guardian/BBC Culture
Australia
Luciano Castillo
Cinemateca de Cuba
Cuba
Monica Castillo
New York Times Watching
US
Samuel Castr

**STEP 5**
Now you need to get your **movie names**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. I showed you this in class, but I'll just tell you again how to do this. To get a list of everything that is not a `<BR>` tag, use this method:

`each_movie = movie_info.find_all(string=True)`

This will give you a list called `each_movie`. Which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [15]:
##TakeYou're working loop And add the find_all for each_movie
#And the inner loop that loops through each_movie

In [18]:
import re
for lines in all_p[3:-5]:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        regex_for_name = r"^\w+[-]?\s?\w+[-]?\s?\w+"
        regex_for_org =  r"\w\s[^-]\s(\w.*) [()]"
        regex_for_cn = r"[(](\w+\s?\w+)[)]"
        critics_name = re.findall(regex_for_name, critic_info)
        critics_org = re.findall(regex_for_org, critic_info)
        critics_cn = re.findall(regex_for_cn, critic_info)
        movie_info = lines.find_next_sibling()      
        each_movie = movie_info.find_all(string=True)
        for movie_list in each_movie:
            movie_name = movie_list
            print(movie_name)                
        print(critic_info)
        print(critics_name[0])
        print(critics_org[0])
        print(critics_cn[0])

#         print(movie_info)  
        print("-------------")

1. Mulholland Drive (David Lynch, 2001)
2. In the Mood for Love (Wong Kar-wai, 2000)
3. The Tree of Life (Terrence Malick, 2011)
4. Yi Yi: A One and a Two (Edward Yang, 2000)
5. Goodbye to Language (Jean-Luc Godard, 2014)
6. The White Meadows (Mohammad Rasoulof, 2009)
7. Night Across the Street (Raoul Ruiz, 2012)
8. Certified Copy (Abbas Kiarostami, 2010)
9. Sparrow (Johnnie To, 2008)
10. Fados (Carlos Saura, 2007)
Simon Abrams – Freelance film critic (US)
Simon Abrams
Freelance film critic
US
-------------
1. In the Mood for Love (Wong Kar-wai, 2000)
2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)
3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)
4. Spirited Away (Hayao Miyazaki, 2001)
5. The Act of Killing (Joshua Oppenheimer, 2012)
6. The Grand Budapest Hotel (Wes Anderson, 2014)
7. The New World (Terrence Malick, 2004)
8. Certified Copy (Abbas Kiarostami, 2010)
9. The World (Jia Zhangke, 2004)
10. Elephant (Gus Van Sant, 2003)
Sam Adams – Freelance film cr

8. Brokeback Mountain (Ang Lee, 2005)
9. A Separation (Asghar Farhadi, 2011)
10. Goodbye to Language (Jean-Luc Godard, 2014)
Uri Klein – Haaretz (Israel)
Uri Klein
Haaretz
Israel
-------------
1. Boyhood (Richard Linklater, 2014)
2. Holy Motors (Leos Carax, 2012)
3. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)
4. Uncle Boonmee Who Can Recall His Past Lives (Apichatpong Weerasethakul, 2010)
5. Enter the Void (Gaspar Noé, 2009)
6. Son of Saul (László Nemes, 2015)
7. Leviathan (Andrey Zvyagintsev, 2014)
8. The Act of Killing (Joshua Oppenheimer, 2012)
9. 12 Years a Slave (Steve McQueen, 2013)
10. Idiocracy (Mike Judge, 2006)
Eric Kohn – IndieWire (US)
Eric Kohn
IndieWire
US
-------------
1. Spirited Away (Hayao Miyazaki, 2001)
2. Yi Yi: A One and a Two (Edward Yang, 2000)
3. Before Sunset (Richard Linklater, 2004)
4. You Can Count On Me (Kenneth Lonergan, 2000)
5. Inside Out (Pete Docter, 2015)
6. Morvern Callar (Lynne Ramsay, 2012)
7. Stories We Tell (Sarah Polley, 2012)
8

Qatar
-------------
1. Spring, Summer, Fall, Winter…and Spring (Kim Ki-duk, 2003)
2. The Hours (Stephen Daldry, 2002)
3. The Sun Also Rises (Jiang Wen, 2007)
4. A Separation (Asghar Farhadi, 2011)
5. Lust, Caution (Ang Lee, 2007)
6. The Lives of Others (Florian Henckel von Donnersmarck, 2006)
7. Still Life (Jia Zhangke, 2006)
8. Birdman (Alejandro González Iñárritu, 2014)
9. Infernal Affairs (Andrew Lau and Alan Mak, 2002)
10. City of God (Fernando Meirelles and Kátia Lund, 2002)
Raymond Zhou – China Daily (China)
Raymond Zhou
China Daily
China
-------------
If you would like to comment on this story or anything else you have seen on BBC Culture, head over to our 
Facebook
 page or message us on 
Twitter
.
More on BBC Culture’s 100 greatest films of the 21st Century:
More on BBC


IndexError: list index out of range

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [17]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r"\d. (\w.*)\s[(]"
regex_for_rank = r"^\d+"
movie_name = re.findall(regex_for_mname,movie_sample)
# movie_name[0]
movie_rank = re.findall(regex_for_rank,movie_sample)
movie_rank[0]
# regex_for_dir = r"[(](\w.*),"
# regex_for_year = r", (\d+)[)]" 
# director_name = re.findall(regex_for_dir,movie_harder)
# year = re.findall(regex_for_year,movie_harder)
# director_name[0]
# year[0]


'1'

**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get the move name.

So now the entire loop should be getting you 13 elements:
-critic_name
-critic_org
-critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instances of:
-rank (this is actually optional)
-movie_name
-director
-year

Build this loop using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [19]:
#Get that loop working here
import re
for lines in all_p[3:]:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        regex_for_name = r"^\w+[-]?\s?\w+[-]?\s?\w+"
        regex_for_org =  r"\w\s[^-]\s(\w.*) [()]"
        regex_for_cn = r"[(](\w+\s?\w+)[)]"
        critics_name = re.findall(regex_for_name, critic_info)
        critics_org = re.findall(regex_for_org, critic_info)
        critics_cn = re.findall(regex_for_cn, critic_info)
        movie_info = lines.find_next_sibling()      
        each_movie = movie_info.find_all(string=True)
        for movie_list in each_movie:
            movie = movie_list
            regex_for_rank =r"^\d+"
            regex_for_mname = r"\d. (\w.*)\s[(]"
            regex_for_dir = r"[(](\w.*),"
            regex_for_year = r", (\d+)[)]" 
            movie_rank = re.findall(regex_for_rank,movie)
            movie_name = re.findall(regex_for_mname,movie)
            movie_dir = re.findall(regex_for_dir,movie)
            movie_year = re.findall(regex_for_year,movie)


            #print(movie_rank)
            print(movie_name)
            #print(movie_dir)
            #print(movie_year)
#        print(critic_info)
#        print(critics_name)
#        print(critics_org)
#        print(critics_cn)
#       print(movie_info)  
        print("-------------")


['Mulholland Drive']
['In the Mood for Love']
['The Tree of Life']
['Yi Yi: A One and a Two']
['Goodbye to Language']
['The White Meadows']
['Night Across the Street']
['Certified Copy']
['Sparrow']
['Fados']
-------------
['In the Mood for Love']
['Eternal Sunshine of the Spotless Mind']
['Syndromes and a Century']
['Spirited Away']
['The Act of Killing']
['The Grand Budapest Hotel']
['The New World']
['Certified Copy']
['The World']
['Elephant']
-------------
['Zero Dark Thirty']
['A History of Violence']
['The Grand Budapest Hotel']
['Stories We Tell']
['Casino Royale']
['Eternal Sunshine of the Spotless Mind']
['Tabu']
['Snow White']
['Frozen River']
['Gosford Park']
-------------
['In the Mood for Love']
['Mulholland Drive']
['Inception']
["Pan's Labyrinth"]
['Caché']
['Grizzly Man']
['4 Months, 3 Weeks & 2 Days']
['Holy Motors']
['The Last of the Unjust']
['There Will Be Blood']
-------------
['The Piano Teacher']
['Margaret']
['American Psycho']
['4 Months, 3 Weeks & 2 Days']
['

['Mulholland Drive']
['Waking Life']
['In the Mood for Love']
['Moonrise Kingdom']
['25th Hour']
['Certified Copy']
['AI: Artificial Intelligence']
['Only Lovers Left Alive']
-------------
['Melancholia']
['Spirited Away']
['Beyond the Hills']
['Let the Right One In']
['Stories We Tell']
['Carol']
['The Death of Mr Lazarescu']
['The Master']
['Caché']
['All Divided Selves']
-------------
['The White Ribbon']
['Winter Sleep']
['Timbuktu']
['Mia Madre']
['The Return']
['Oasis']
['Bad Education']
['The Artist']
['Oldboy']
['The House of Mirth']
-------------
['Boyhood']
['Crouching Tiger, Hidden Dragon']
['Locke']
['The Hurt Locker']
["Pan's Labyrinth"]
['The Lives of Others']
['Sin Nombre']
['Nebraska']
['Fruitvale Station']
['Man on Wire']
-------------
['A Separation']
['Babel']
['Spring, Summer, Fall, Winter…and Spring']
['In the Mood for Love']
['Spirited Away']
['Brokeback Mountain']
['Oldboy']
['Ten']
['Russian Ark']
['Once Upon a Time in Anatolia']
-------------
['Goodbye to Langu

**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not, don't worry! I will get you through by midweek.

The final step is building a list of lists of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?


In the cell below, I give you a final architecture you need to use to get this most challenging list of lists.

In [None]:
#figure out how you're going to collect your clean information
list_of_what = []
#for loop that goes throug all the <p> elements
    #if strong (begins with the critic)
        #critic_info= get the critic line
        #critic_name = re.findall(regex,critic_info)
        #critic_org = re.findall(regex,critic_info)
        #critic_cn = re.findall(regex,critic_info)
        #movie_info = get movie line using next_sibling
        #get each movie string
        #loop through each movie_line (#1 through #10)
            #movie_rank = re.findall(regex,movie_line)
            #movie_name = re.findall(regex,movie_line)
            #movie_dir = re.findall(regex,movie_line)
            #movie_year = re.findall(regex,movie_line)
            #this will happen 10 times

#You will want to build a list tickets appended to list_of_what
#Try to figure out how you want to append things
#That is, how you want to organize your data

    

In [20]:
import re
list_of_all = []
for lines in all_p[3:-5]:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        regex_for_name = r"^\w+[-]?\s?\w+[-]?\s?\w+"
        regex_for_org =  r"\w\s[^-]\s(\w.*) [()]"
        regex_for_cn = r"[(](\w+\s?\w+)[)]"
        try:
            critics_name = re.findall(regex_for_name, critic_info)[0]
            critics_org = re.findall(regex_for_org, critic_info)[0]
            critics_cn = re.findall(regex_for_cn, critic_info)[0]
        except:
            pass
        movie_info = lines.find_next_sibling()      
        each_movie = movie_info.find_all(string=True)
        #list_of_all.append(critics_name)
        #list_of_all.append(critics_org)
        #list_of_all.append(critics_cn)
           
        list_of_movie = []
        for movie_list in each_movie:
            movie = movie_list
            regex_for_rank =r"^\d+"
            regex_for_mname = r"\d. (\w.*)\s[(]"
            regex_for_dir = r"[(](\w.*),"
            regex_for_year = r", (\d+)[)]"
            try:
                movie_rank = re.findall(regex_for_rank,movie)[0]
                movie_name = re.findall(regex_for_mname,movie)[0]
                movie_dir = re.findall(regex_for_dir,movie)[0]
                movie_year = re.findall(regex_for_year,movie)[0]
    #             list_of_movie.append(movie_rank)

                list_of_movie.append(critics_name)
                list_of_movie.append(critics_cn)
                list_of_movie.append(critics_org)
                list_of_movie.append(movie_name)            
                list_of_movie.append(movie_dir)                       
                list_of_movie.append(movie_year)
            except:
                pass
#             print(movie_rank)
#             print(movie_name)
#             print(movie_dir)
#             print(movie_year)
            list_of_all.append(list_of_movie)
#     print("-----------")
    #list_of_all.append(list_of_movie)


In [21]:
##Take a peek at your final lists of lists

# list_of_all
list_of_all

[['Simon Abrams',
  'US',
  'Freelance film critic',
  'Mulholland Drive',
  'David Lynch',
  '2001',
  'Simon Abrams',
  'US',
  'Freelance film critic',
  'In the Mood for Love',
  'Wong Kar-wai',
  '2000',
  'Simon Abrams',
  'US',
  'Freelance film critic',
  'The Tree of Life',
  'Terrence Malick',
  '2011',
  'Simon Abrams',
  'US',
  'Freelance film critic',
  'Yi Yi: A One and a Two',
  'Edward Yang',
  '2000',
  'Simon Abrams',
  'US',
  'Freelance film critic',
  'Goodbye to Language',
  'Jean-Luc Godard',
  '2014',
  'Simon Abrams',
  'US',
  'Freelance film critic',
  'The White Meadows',
  'Mohammad Rasoulof',
  '2009',
  'Simon Abrams',
  'US',
  'Freelance film critic',
  'Night Across the Street',
  'Raoul Ruiz',
  '2012',
  'Simon Abrams',
  'US',
  'Freelance film critic',
  'Certified Copy',
  'Abbas Kiarostami',
  '2010',
  'Simon Abrams',
  'US',
  'Freelance film critic',
  'Sparrow',
  'Johnnie To',
  '2008',
  'Simon Abrams',
  'US',
  'Freelance film critic',
 

If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic, but they will not be nearly as complicated as this one.

In [23]:
list_of_movie = []
for lines in all_p[3:]:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        regex_for_name = r"^\w+\s\w+ "
        regex_for_org =  r"\w\s[^-]\s(\w.*) [()]"
        regex_for_cn = r"[(](\w+)[)]"
        try:
            critics_name = re.findall(regex_for_name, critic_info)[0]
            critics_org = re.findall(regex_for_org, critic_info)[0]
            critics_cn = re.findall(regex_for_cn, critic_info)[0]
        except:
            pass
        movie_info = lines.find_next_sibling()      
        each_movie = movie_info.find_all(string=True)
        
        for movie_list in each_movie:
            movie_dict = {}
            movie = movie_list
            regex_for_rank =r"^\d+"
            regex_for_mname = r"\d. (\w.*)\s[(]"
            regex_for_dir = r"[(](\w.*),"
            regex_for_year = r", (\d+)[)]" 
            movie_rank = re.findall(regex_for_rank,movie)
            try:
                movie_dict['critics_name'] = critics_name
                movie_dict['critics_org'] = critics_org
                movie_dict['critics_cn'] = critics_cn
                movie_dict['movie_name'] = re.findall(regex_for_mname,movie)[0]
                movie_dict['movie_dir'] = re.findall(regex_for_dir,movie)[0]
                movie_dict['movie_year'] = re.findall(regex_for_year,movie)[0]
            except:
                pass

        
            list_of_movie.append(movie_dict)
    
list_of_movie[-1]

{'critics_name': 'More on ',
 'critics_org': 'China Daily',
 'critics_cn': 'China'}

In [24]:
df_movie = pd.DataFrame(list_of_movie)

df_movie = df_movie[['critics_name', 'critics_org', 'critics_cn', 'movie_name', 'movie_dir', 'movie_year']]
df_movie.head(10)

Unnamed: 0,critics_name,critics_org,critics_cn,movie_name,movie_dir,movie_year
0,Simon Abrams,Freelance film critic,US,Mulholland Drive,David Lynch,2001
1,Simon Abrams,Freelance film critic,US,In the Mood for Love,Wong Kar-wai,2000
2,Simon Abrams,Freelance film critic,US,The Tree of Life,Terrence Malick,2011
3,Simon Abrams,Freelance film critic,US,Yi Yi: A One and a Two,Edward Yang,2000
4,Simon Abrams,Freelance film critic,US,Goodbye to Language,Jean-Luc Godard,2014
5,Simon Abrams,Freelance film critic,US,The White Meadows,Mohammad Rasoulof,2009
6,Simon Abrams,Freelance film critic,US,Night Across the Street,Raoul Ruiz,2012
7,Simon Abrams,Freelance film critic,US,Certified Copy,Abbas Kiarostami,2010
8,Simon Abrams,Freelance film critic,US,Sparrow,Johnnie To,2008
9,Simon Abrams,Freelance film critic,US,Fados,Carlos Saura,2007


In [25]:
df_movie.dropna(subset = ['movie_name'], inplace = True)


In [26]:
df_movie.to_csv("df_movie_all.csv", index=False )

In [27]:
#What year had the most movies selected?
df_movie.movie_year.value_counts().head(5)

2000    153
2001    145
2007    133
2014    130
2011    126
Name: movie_year, dtype: int64

In [28]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait


In [29]:
###function
driver = webdriver.Chrome()
import time
def get_country_info(row):
    try:
        driver.get("https://www.imdb.com/")
        time.sleep(1)
        text_input = driver.find_element_by_id("navbar-query")
        time.sleep(1)
        text_input.send_keys(row['movie_dir'])
        time.sleep(1)
        info = driver.find_element_by_class_name("suggestionlabel")
        time.sleep(1)
        info.click()
        time.sleep(1)
        country = driver.find_element_by_id('name-born-info').text

        print("This director is from", country)

        return pd.Series({
            'dir_cn': country
        })
    except:
        return pd.Series({})       

In [30]:
new_df_movie = df_movie.head(2).apply(get_country_info, axis=1).join(df_movie)
new_df_movie = new_df_movie[['critics_name', 'critics_org', 'critics_cn', 'movie_name', 'movie_dir', 'dir_cn', 'movie_year']]
new_df_movie

This director is from Born: June 22, 1947 in Brooklyn, New York, USA
This director is from Born: January 20, 1946 in Missoula, Montana, USA
This director is from Born: July 17, 1956 in Shanghai, China


Unnamed: 0,critics_name,critics_org,critics_cn,movie_name,movie_dir,dir_cn,movie_year
0,Simon Abrams,Freelance film critic,US,Mulholland Drive,David Lynch,"Born: January 20, 1946 in Missoula, Montana, USA",2001
1,Simon Abrams,Freelance film critic,US,In the Mood for Love,Wong Kar-wai,"Born: July 17, 1956 in Shanghai, China",2000


In [31]:
df_dupplicates = df_movie.drop_duplicates(subset=['movie_dir', 'movie_name'])
# df.drop_duplicates(subset=['City', 'State', 'Zip', 'Date'])
df_3 = df_dupplicates.groupby(['movie_dir', 'movie_name']).size().sort_values(ascending=False).reset_index()
df_3

Unnamed: 0,movie_dir,movie_name,0
0,Éric Rohmer,Triple Agent,1
1,Hayao Miyazaki,The Wind Rises,1
2,Harmony Korine,Spring Breakers,1
3,Hany Abu-Assad,Paradise Now,1
4,Guy Maddin,My Winnipeg,1
5,Gus Van Sant,Gerry,1
6,Gus Van Sant,Elephant,1
7,Gurinder Chadha,Bend It Like Beckham,1
8,Guillermo Del Toro,Pan's Labyrinth,1
9,Greg Mottola,Adventureland,1


In [32]:
df_movie_50 = df_movie.groupby(['movie_dir', 'movie_name']).size().sort_values(ascending= False).head(50).reset_index().rename(columns={0:'votes'})
df_movie_50

Unnamed: 0,movie_dir,movie_name,votes
0,Wong Kar-wai,In the Mood for Love,49
1,David Lynch,Mulholland Drive,47
2,Paul Thomas Anderson,There Will Be Blood,35
3,Hayao Miyazaki,Spirited Away,34
4,Richard Linklater,Boyhood,30
5,Michel Gondry,Eternal Sunshine of the Spotless Mind,29
6,Asghar Farhadi,A Separation,28
7,Terrence Malick,The Tree of Life,23
8,Edward Yang,Yi Yi: A One and a Two,22
9,Joel and Ethan Coen,No Country For Old Men,21


In [50]:
#Director with most total movies on the list
df_movie.groupby(['movie_dir'])['movie_name'].nunique().sort_values(ascending= False).head(20)

movie_dir
Quentin Tarantino            6
Clint Eastwood               5
Nuri Bilge Ceylan            5
Tsai Ming-liang              5
Lars von Trier               4
Martin Scorsese              4
Kenneth Lonergan             4
Michael Haneke               4
Joel and Ethan Coen          4
Steven Spielberg             4
Jean-Luc Godard              4
Jafar Panahi                 4
Hou Hsiao-hsien              4
David Fincher                4
Jia Zhangke                  4
Paul Thomas Anderson         4
Christopher Nolan            4
Wes Anderson                 4
Apichatpong Weerasethakul    4
Ang Lee                      4
Name: movie_name, dtype: int64

In [32]:
#Which directors have the most movies selected?
df_movie['movie_dir'].value_counts()

Joel and Ethan Coen                                                                         52
Wong Kar-wai                                                                                51
Paul Thomas Anderson                                                                        51
David Lynch                                                                                 48
Richard Linklater                                                                           39
Hayao Miyazaki                                                                              35
Michael Haneke                                                                              35
Terrence Malick                                                                             32
David Fincher                                                                               31
Asghar Farhadi                                                                              31
Michel Gondry                                     

In [38]:
df_movie['movie_name'].value_counts()

In the Mood for Love                     49
Mulholland Drive                         47
There Will Be Blood                      35
Spirited Away                            34
Boyhood                                  30
Eternal Sunshine of the Spotless Mind    29
A Separation                             28
The Tree of Life                         23
Yi Yi: A One and a Two                   22
No Country For Old Men                   21
Inside Llewyn Davis                      20
Children of Men                          18
4 Months, 3 Weeks & 2 Days               17
Pan's Labyrinth                          17
The Act of Killing                       16
Holy Motors                              16
Zodiac                                   15
Mad Max: Fury Road                       14
The White Ribbon                         13
Talk to Her                              13
Caché                                    13
The Social Network                       13
25th Hour                       

In [135]:
# df_movie_50.to_csv("df_movie_50.csv", index=False )