## BBC Movie List: Scraping and Regex

In 2016 the BBC polled 177 film critics to get their picks for the best films of the century so far. While the BBC's [aggregate poll](http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films) is interesting, the long list including everyone who voted is perhaps more revealing from the data standpoint:

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

How do I wrangle this data? That is the central challenge that you'll be dealing with this week. The HTML page on the BBC site (mirrored on my site) poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit more challenging, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic and you can isolate each group of top 10 movies as well. You need to use beautiful soup find the critic--as well as the list of movies that immediately follow them—and then use regular expression to divide the critic information and the movie info to create the most useful possible data structure. What should the data structure be? That is up to you to figure out.



### Getting started: Data Architecture

The central challenge of this assignment it's figuring out how you are going to set up your table (list of dictionaries) from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: what are the main categories of analysis: Try to design a schema that will give you a table that you can run solid queries on. 

We will eventually to bring this into Python's pandas library (table format) so you want to keep your table as simple and structured as possible. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### Ready to code?

The first thing you need to do is import beautiful soup & requests like we did in the homework, and scrape the page. 

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

(it's also mirrored on my site just in case (by try to use the main link):
https://floatingmedia.com/DataClass/BBC.html

Okay let's begin!

STEP 1:


In [224]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
import requests
from bs4 import BeautifulSoup


In [4]:
# read the URL, and put the HTML page into beautiful soup

url = "https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted"
html = requests.get(url).content
doc = BeautifulSoup(html,'html.parser')

In [None]:
#Using beautiful soup find the tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 
all_info = doc.find_all("div", attrs={'data-component':'text-block'})[2:-3]
all_info

**STEP 2** Using Beautiful Soup figure out how to separate the entries.

In [None]:
#find_all
#first critic 
crcitis_name = all_info[0].find('p').text
print ("This is the first critic: " + crcitis_name)

#first movie 
movie_name =  all_info[1].find_all('p')[2].text
print ("This is the first movie in the first critic list: " + movie_name)


**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard. But in order to get the movies attached to that critic you need to be smart about your beautiful soup method.

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!



In [None]:
##Write your loop for STEP 3 here
critic_info = []
movie_info  = []

##in order to avoid multitude of loops. I will use rang len loop.
for x in range(0, len(all_info),2):
    ##CRITICS
    if x <= 144:
        #print(x,all_info[x])
        #print(x,all_info[x].find('b').text)
        critic_name = all_info[x].find('b').text
        critic_info.append(critic_name)
    if x >= 146:
        #print (x,all_info[x-1])
        #print(x,all_info[x-1].find('b').text)
        critic_name = all_info[x-1].find('b').text
        critic_info.append(critic_name)
    
    ##MOVIES
    if x <= 143:
        first_movies_list = all_info[x+1].find_all('p')
        #print(x,all_info[x+1].find_all('p'))
        for j in first_movies_list:
            first_movie = j.text
            #print("this is movie",movie)
            movie_info.append(first_movie)

    elif x >= 145:
        #print (x,all_info[x].find_all('p'))
        second_movies_list = all_info[x].find_all('p')
        for j in second_movies_list:
            second_movie = j.text
            #print("Next movie List",movie)
            movie_info.append(second_movie)
    
    if x == 144:
        #print(x,all_info[x].find_all('p'))
        broken_list = all_info[x].find_all('p')
        for j in range(len(broken_list)):
            if j > 0:
                b_m_names = broken_list[j].text
                b_c_name = broken_list[j]
                movie_info.append(b_m_names)
critic_info

    

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [None]:
import re

crit_sample = "Arturo Aguilar - Rolling Stone Mexico (Mexico)"
regex_for_name = r"(^[\w ]+) \-+"
regex_for_org = r"\-+ ([\w ]+) \(+"
regex_for_cn = r"\(+([\w]+)\)"


name = re.findall(regex_for_name, crit_sample)
org = re.findall(regex_for_org, crit_sample)
cn = re.findall(regex_for_cn, crit_sample)

print("Name:", name)
print("Organization:", org)
print("Country:", cn)


In [None]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it
for x in range(0, len(all_info),2):
    if x <= 144:
        critic_name = all_info[x].find('b').text
        name_regex = r"([\w ]+) \–+"
        org_regex = r"\–+ (.+) \(+"
        cn_regex = r"\(+([\w ]+)\)"
        name = re.findall(name_regex, critic_name)
        org = re.findall(org_regex, critic_name)
        cn = re.findall(cn_regex, critic_name)
        print (name,org,cn)
    if x >= 146:
        critic_name = all_info[x-1].find('b').text
        name_regex = r"([\w ]+) \–+"
        org_regex = r"\–+ (.+) \(+"
        cn_regex = r"\(+([\w ]+)\)"
        name = re.findall(name_regex, critic_name)
        org = re.findall(org_regex, critic_name)
        cn = re.findall(cn_regex, critic_name)
        print (name,org,cn)

**STEP 5**
Now you need to get your **movie info**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. See our old scraping homeworks for how to get a list of each movie entry--which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [None]:
##Take your working loop And add the find_all for each_movie
#And the inner loop that loops through each_movie

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [None]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
mn_regex = r"\d+. (.+) \("
md_regex = r"\((.+),"
mr_regex = r"\(.+, (\d+)\)"

#what else should you extract???
movie_name = re.findall(mn_regex,movie_harder)
movie_director = re.findall(md_regex,movie_harder)
movie_rel_year = re.findall(mr_regex,movie_sample)
movie_rel_year


**STEP 6**
You're almost there!!! Now that you have working regulars expression put that in your inner loop to get the movie name.

So now the entire loop should be getting critic information and movie information all separated as separate columns/properties.

Build this loop(s) using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [None]:
#Get that loop working here
for x in range(0, len(all_info),2):    
    ####CRITIC 
    if x <= 144:
        critic_name = all_info[x].find('b').text
        name_regex = r"([\w ]+) \–+"
        org_regex = r"\–+ (.+) \(+"
        cn_regex = r"\(+([\w ]+)\)"
        name = re.findall(name_regex, critic_name)
        org = re.findall(org_regex, critic_name)
        cn = re.findall(cn_regex, critic_name)
        print (name,org,cn)
    if x >= 146:
        critic_name = all_info[x-1].find('b').text
        name_regex = r"([\w ]+) \–+"
        org_regex = r"\–+ (.+) \(+"
        cn_regex = r"\(+([\w ]+)\)"
        name = re.findall(name_regex, critic_name)
        org = re.findall(org_regex, critic_name)
        cn = re.findall(cn_regex, critic_name)
        print (name,org,cn)    
    
    ####MOVIES
    if x <= 143:
        first_movies_list = all_info[x+1].find_all('p')
        #print(x,all_info[x+1].find_all('p'))
        for j in first_movies_list:
            first_movie = j.text
            mn_regex = r"\d+. (.+) \("
            md_regex = r"\((.+), "
            mr_regex = r"\(.+,* (\d+)\)"
            first_regex_mn = re.findall (mn_regex, first_movie)
            first_regex_md = re.findall (md_regex, first_movie)
            first_regex_mr = re.findall (mr_regex, first_movie)
            print(first_regex_mn,first_regex_md,first_regex_mr)

    if x == 144:
        #print(x,all_info[x].find_all('p'))
        broken_list = all_info[x].find_all('p')
        for j in range(len(broken_list)):
            if j > 0:
                b_m_names = broken_list[j].text
                mn_regex = r"\d+. (.+) \("
                md_regex = r"\((.+),"
                mr_regex = r"\(.+,* (\d+)\)"
                b_m_regex_mn = re.findall (mn_regex, b_m_names)
                b_m_regex_md = re.findall (md_regex, b_m_names)
                b_m_regex_mr = re.findall (mr_regex, b_m_names)
                print(b_m_regex_mn,b_m_regex_md,b_m_regex_mr)

    elif x >= 145:
        #print (x,all_info[x].find_all('p'))
        second_movies_list = all_info[x].find_all('p')
        for j in second_movies_list:
            second_movie = j.text
            mn_regex = r"\d+. (.+) \("
            md_regex = r"\((.+),"
            mr_regex = r"\(.+,* (\d+)\)"
            second_regex_mn = re.findall (mn_regex, second_movie)
            second_regex_md = re.findall (md_regex, second_movie)
            second_regex_mr = re.findall (mr_regex, second_movie)
            print(second_regex_mn,second_regex_md,second_regex_mr)
    


**STEP 7**
This is the final step of the hardest part! 

The final step is building a list of dictionaries of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?




In [507]:
#figure out how you're going to collect your clean information
list_of_what = []
test_list = []

#loop through the beautiful soup elements
#and use the regexes you developed above to get each unit of info


#Try to figure out how you want to append things
#That is, how you want to organize your data

#Get that loop working here
for x in range(0, len(all_info),2):    
    ####CRITIC 
    if x <= 144:
        critic_name = all_info[x].find('b').text
        name_regex = r"([\w ]+) \–+"
        org_regex = r"\–+ (.+) \(+"
        cn_regex = r"\(+([\w ]+)\)"
        name = re.findall(name_regex, critic_name)
        name_ = " ".join(name)
        org = re.findall(org_regex, critic_name)
        org_ = " ".join(org)
        cn = re.findall(cn_regex, critic_name)
        cn_ = " ".join(cn)
        critic_name_dic = {
            "Critic Name": name_,
            "Critic Organization": org_,
            "Critic Country": cn_
        }
        test_list.append(critic_name_dic)

    elif x >= 146:
        critic_name = all_info[x-1].find('b').text
        name_regex = r"([\w ]+) \–+"
        org_regex = r"\–+ (.+) \(+"
        cn_regex = r"\(+([\w ]+)\)"
        name = re.findall(name_regex, critic_name)
        name_ = " ".join(name)
        org = re.findall(org_regex, critic_name)
        org_ = " ".join(org)
        cn = re.findall(cn_regex, critic_name)
        cn_ = " ".join(cn)
        critic_name_sec_dic = {
            "Critic Name": name_,
            "Critic Organization": org_,
            "Critic Country": cn_
        }
        test_list.append(critic_name_sec_dic)
        
    ####MOVIES
    if x <= 143:
            first_movies_list = all_info[x+1].find_all('p')
            #print(x,all_info[x+1].find_all('p'))
            for j in first_movies_list:
                first_movie = j.text
                mn_regex = r"\d+. (.+) \("
                md_regex = r"\((.+), "
                mr_regex = r"\(.+,* (\d+)\)"
                first_regex_mn = re.findall (mn_regex, first_movie)
                first_regex_mn_ = " ".join(first_regex_mn)
                first_regex_md = re.findall (md_regex, first_movie)
                first_regex_md_ = " ".join(first_regex_md)
                first_regex_mr = re.findall (mr_regex, first_movie)
                first_regex_mr_ = " ".join(first_regex_mr)
                movies_dic_items = {
                    "Movie Name": first_regex_mn_,
                    "Movie Director": first_regex_md_,
                    "Movie Release Year": first_regex_mr_
                }
                test_list.append(movies_dic_items)
                
            # print(first_regex_mn,first_regex_md,first_regex_mr)

    elif x == 144:
        #print(x,all_info[x].find_all('p'))
        broken_list = all_info[x].find_all('p')
        for j in range(len(broken_list)):
            if j > 0:
                b_m_names = broken_list[j].text
                mn_regex = r"\d+. (.+) \("
                mr_regex = r"\(.+,* (\d+)\)"
                md_regex = r"\((.+),"
                b_m_regex_mn = re.findall (mn_regex, b_m_names)
                b_m_regex_mn_ = " ".join(b_m_regex_mn)
                b_m_regex_md = re.findall (md_regex, b_m_names)
                b_m_regex_md_ = " ".join(b_m_regex_md)
                b_m_regex_mr = re.findall (mr_regex, b_m_names)
                b_m_regex_mr_ = " ".join(b_m_regex_mr)
                movies_dic_fixed = {
                    "Movie Name": b_m_regex_mn_,
                    "Movie Director": b_m_regex_md_,
                    "Movie Release Year": b_m_regex_mr_
                }
                test_list.append(movies_dic_fixed)
                
    elif x >= 145:
        #print (x,all_info[x].find_all('p'))
        second_movies_list = all_info[x].find_all('p')
        for j in second_movies_list:
            second_movie = j.text
            mn_regex = r"\d+. (.+) \("
            md_regex = r"\((.+),"
            mr_regex = r"\(.+,* (\d+)\)"
            second_regex_mn = re.findall (mn_regex, second_movie)
            second_regex_mn_ = " ".join(second_regex_mn)
            second_regex_md = re.findall (md_regex, second_movie)
            second_regex_md_ = " ".join(second_regex_md)
            second_regex_mr = re.findall (mr_regex, second_movie)
            second_regex_mr_ = " ".join(second_regex_mr)
            second_movies_dic = {
                "Movie Name": second_regex_mn_,
                "Movie Director": second_regex_md_,
                "Movie Release Year": second_regex_mr_
            }
            test_list.append(second_movies_dic)

#list_of_what
test_list


[{'Critic Name': 'Simon Abrams',
  'Critic Organization': 'Freelance film critic',
  'Critic Country': 'US'},
 {'Movie Name': 'Mulholland Drive',
  'Movie Director': 'David Lynch',
  'Movie Release Year': '2001'},
 {'Movie Name': 'In the Mood for Love',
  'Movie Director': 'Wong Kar-wai',
  'Movie Release Year': '2000'},
 {'Movie Name': 'The Tree of Life',
  'Movie Director': 'Terrence Malick',
  'Movie Release Year': '2011'},
 {'Movie Name': 'Yi Yi: A One and a Two',
  'Movie Director': 'Edward Yang',
  'Movie Release Year': '2000'},
 {'Movie Name': 'Goodbye to Language',
  'Movie Director': 'Jean-Luc Godard',
  'Movie Release Year': '2014'},
 {'Movie Name': 'The White Meadows',
  'Movie Director': 'Mohammad Rasoulof',
  'Movie Release Year': '2009'},
 {'Movie Name': 'Night Across the Street',
  'Movie Director': 'Raoul Ruiz',
  'Movie Release Year': '2012'},
 {'Movie Name': 'Certified Copy',
  'Movie Director': 'Abbas Kiarostami',
  'Movie Release Year': '2010'},
 {'Movie Name': 'Spa

In [None]:
##Take a peek at your final lists of lists
test_list

If you made it this far, yay!
