## BBC Movie List: Scraping and Regex

In 2016 the BBC polled 177 film critics to get their picks for the best films of the century so far. While the BBC's [aggregate poll](http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films) is interesting, the long list including everyone who voted is perhaps more revealing from the data standpoint:

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

How do I wrangle this data? That is the central challenge that you'll be dealing with this week. The HTML page on the BBC site (mirrored on my site) poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit more challenging, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic and you can isolate each group of top 10 movies as well. You need to use beautiful soup find the critic--as well as the list of movies that immediately follow them—and then use regular expression to divide the critic information and the movie info to create the most useful possible data structure. What should the data structure be? That is up to you to figure out.



### Getting started: Data Architecture

The central challenge of this assignment it's figuring out how you are going to set up your table (list of dictionaries) from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: what are the main categories of analysis: Try to design a schema that will give you a table that you can run solid queries on. 

We will eventually to bring this into Python's pandas library (table format) so you want to keep your table as simple and structured as possible. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### Ready to code?

The first thing you need to do is import beautiful soup & requests like we did in the homework, and scrape the page. 

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

(it's also mirrored on my site just in case (by try to use the main link):
https://floatingmedia.com/DataClass/BBC.html

Okay let's begin!

STEP 1:


In [15]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
import requests
from bs4 import BeautifulSoup


In [16]:
# read the URL, and put the HTML page into beautiful soup

url = "https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted"
html = requests.get(url).content
doc = BeautifulSoup(html,'html.parser')

In [355]:
#Using beautiful soup find the tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 
all_info = doc.find_all("div", attrs={'data-component':'text-block'})[2:-3]


**STEP 2** Using Beautiful Soup figure out how to separate the entries.

In [356]:
#find_all
#first critic 
crcitis_name = all_info[0].find('p').text
print ("This is the first critic: " + crcitis_name)

#first movie 
movie_name =  all_info[1].find_all('p')[2].text
print ("This is the first movie in the first critic list: " + movie_name)


This is the first critic: Simon Abrams – Freelance film critic (US)
This is the first movie in the first critic list: 3. The Tree of Life (Terrence Malick, 2011)


**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard. But in order to get the movies attached to that critic you need to be smart about your beautiful soup method.

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!



In [360]:
##Write your loop for STEP 3 here

for i in range(0, len(all_info), 2):
    critic_tag = all_info[i].find('p')
    if critic_tag:
        critic_info = critic_tag.text
        if i + 1 < len(all_info):
            movie_info = all_info[i + 1].find_all('p')
            print("Critic Info:", critic_info)
            for movie in movie_info:
                print("Movie Info:", movie.text)
                


Critic Info: Simon Abrams – Freelance film critic (US)
Movie Info: 1. Mulholland Drive (David Lynch, 2001)
Movie Info: 2. In the Mood for Love (Wong Kar-wai, 2000)
Movie Info: 3. The Tree of Life (Terrence Malick, 2011)
Movie Info: 4. Yi Yi: A One and a Two (Edward Yang, 2000)
Movie Info: 5. Goodbye to Language (Jean-Luc Godard, 2014)
Movie Info: 6. The White Meadows (Mohammad Rasoulof, 2009)
Movie Info: 7. Night Across the Street (Raoul Ruiz, 2012)
Movie Info: 8. Certified Copy (Abbas Kiarostami, 2010)
Movie Info: 9. Sparrow (Johnnie To, 2008)
Movie Info: 10. Fados (Carlos Saura, 2007)
Critic Info: Sam Adams – Freelance film critic (US)
Movie Info: 1. In the Mood for Love (Wong Kar-wai, 2000)
Movie Info: 2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)
Movie Info: 3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)
Movie Info: 4. Spirited Away (Hayao Miyazaki, 2001)
Movie Info: 5. The Act of Killing (Joshua Oppenheimer, 2012)
Movie Info: 6. The Grand Budapest Ho

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [252]:
import re

crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
regex_for_name = r"([A-Za-z\s]+) –"
regex_for_org = r"– ([A-Za-z\s]+) \("
regex_for_cn = r"\((.*?)\)"

name = re.findall(regex_for_name, crit_sample)
org = re.findall(regex_for_org, crit_sample)
cn = re.findall(regex_for_cn, crit_sample)

print("Name:", name[0])
print("Organization:", org[0])
print("Country:", cn[0])


Name: Arturo Aguilar
Organization: Rolling Stone Mexico
Country: Mexico


In [199]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it


[{'critic_name': ['Simon Abrams'],
  'critic_org': ['Freelance film critic'],
  'critic_cn': ['US'],
  'movie_name': [],
  'movie_director': [],
  'movie_year': []},
 {'critic_name': ['Sam Adams'],
  'critic_org': ['Freelance film critic'],
  'critic_cn': ['US'],
  'movie_name': [],
  'movie_director': [],
  'movie_year': []},
 {'critic_name': ['Thelma Adams'],
  'critic_org': ['Freelance film critic'],
  'critic_cn': ['US'],
  'movie_name': [],
  'movie_director': [],
  'movie_year': []},
 {'critic_name': ['Arturo Aguilar'],
  'critic_org': ['Rolling Stone Mexico'],
  'critic_cn': ['Mexico'],
  'movie_name': [],
  'movie_director': [],
  'movie_year': []},
 {'critic_name': ['Matthew Anderson'],
  'critic_org': ['BBC Culture'],
  'critic_cn': ['UK'],
  'movie_name': [],
  'movie_director': [],
  'movie_year': []},
 {'critic_name': ['Tim Appelo'],
  'critic_org': ['The Wrap'],
  'critic_cn': ['US'],
  'movie_name': [],
  'movie_director': [],
  'movie_year': []},
 {'critic_name': [],
  

**STEP 5**
Now you need to get your **movie info**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. See our old scraping homeworks for how to get a list of each movie entry--which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [None]:
##Take your working loop And add the find_all for each_movie
#And the inner loop that loops through each_movie

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [None]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r""
#what else should you extract???
movie_name = re.findall(regex_for_mname,movie_sample)
movie_name[0]


**STEP 6**
You're almost there!!! Now that you have working regulars expression put that in your inner loop to get the movie name.

So now the entire loop should be getting critic information and movie information all separated as separate columns/properties.

Build this loop(s) using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [None]:
#Get that loop working here








**STEP 7**
This is the final step of the hardest part! 

The final step is building a list of dictionaries of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?




In [None]:
#figure out how you're going to collect your clean information
list_of_what = []

#loop through the beautiful soup elements
#and use the regexes you developed above to get each unit of info


#Try to figure out how you want to append things
#That is, how you want to organize your data

    

In [None]:
##Take a peek at your final lists of lists
list_of_what

If you made it this far, yay!
