## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with through Thursday. The HTML page on the BBC site poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies that immediately follow them. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just DM me on Slack and I will help you!)

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, Slack me and I will help  get your code working so you can move on to the next step.


### Getting started: Data Architecture

The central challenge of this project it's figuring out how you are going to set up your table or tables from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: the main categories of analysis that are possible include movie, director, critic, critic's country, year, and whatever else you bring to this. Try to design a schema that will give you a table that you can run solid queries on. 

You will eventually want to bring this into pandas so you want to keep your table simple and structured as possible. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### Interpretive Architecture
**REMEMBER: secondary source** Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.

You don't necessarily have to go in the direction of directors' origin. You can certainly try to think of other categories of interpretation that you can join to this initial dataset. This is how you bring your point-of-view to a relatively large data set that seeks to frame the past 15 years of cinema. How can you bring a different point-of-view to this subject? You can certainly narrow your focus to a specific country, the group of countries, or a region. Either way, think about other data that might bring different types of insight to this list.

### Ready to code?

The first thing you need to do is import beautiful soup & requests like we did in the homework, and scrape the page. http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted


One thing I should note there are two inconsistencies (actual errors in the HTML) that will cause you to lose a couple entries (which is okay but may be frustrating). I have posted a version of the exact same page with those inconsistencies fixed, if you want to scrape from that page: 

http://floatingmedia.com/columbia/BBC.html

It's up to you. Okay let's begin!

STEP 1:


In [192]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
from bs4 import BeautifulSoup
import requests


In [193]:
# read the URL, and put the HTML page into beautiful soup
my_url = "http://floatingmedia.com/columbia/BBC.html"
raw_html = requests.get(my_url).content


In [194]:
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(soup_doc.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <title>
   BBC - Culture - The 21st Century’s 100 greatest films: Who voted?
  </title>
  <meta content="story, STORY, story, image, the-100-greatest-films-of-the-21st-century, " name="keywords"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." name="description"/>
  <meta content="The 21st Century’s 100 greatest films: Who voted?" property="og:title">
   <meta content="article" property="og:type">
    <meta content="http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted" property="og:url">
     <meta content="We polled 177 critics from around the world – here is how they voted." property="og:description">
      <meta content="summary_large_image" name="twitter:card"/>
      <meta content="@BBC_Culture" name="twitter:site"/>
      <meta content="The 21st Century’s 100 greatest fil

In [195]:
#Using beautiful soup find the div tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 


all_info = soup_doc.find_all("div")
print(all_info)

[<div class="bbccom_display_none" id="bbccom_interstitial_ad"></div>, <div class="bbccom_display_none" id="bbccom_interstitial"><script type="text/javascript"> /*<![CDATA[*/ (function() { if (window.bbcdotcom && bbcdotcom.config.isActive('ads')) { googletag.cmd.push(function() { googletag.display('bbccom_interstitial'); }); } }()); /*]]>*/ </script></div>, <div class="bbccom_display_none" id="bbccom_wallpaper_ad"></div>, <div class="bbccom_display_none" id="bbccom_wallpaper"><script type="text/javascript"> /*<![CDATA[*/ (function() { var wallpaper; if (window.bbcdotcom && bbcdotcom.config.isActive('ads')) { if (bbcdotcom.config.isAsync()) { googletag.cmd.push(function() { googletag.display('bbccom_wallpaper'); }); } else { googletag.display("wallpaper"); } wallpaper = bbcdotcom.adverts.adRegister.getAd('wallpaper'); } }()); /*]]>*/ </script></div>, <div id="blq-global"> <div id="blq-pre-mast"> </div> </div>, <div id="blq-pre-mast"> </div>, <div class="orb-nav-pri orb-nav-pri-white b-he

**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [196]:
#find_all
soup_doc.find_all("p")
all_p = soup_doc.find_all("p")
print(all_p)


[<p style="position: absolute; top: -999em"><img alt="" height="1" src="//sa.bbc.co.uk/bbc/bbc/s?name=culture.story.20160819-the-21st-centurys-100-greatest-films-who-voted.page&amp;ml_name=webmodule&amp;ml_version=65&amp;blq_js_enabled=0&amp;blq_s=4d&amp;blq_r=2.7&amp;blq_v=default&amp;blq_e=pal&amp;pal_route=webserviceapi&amp;app_type=responsive&amp;language=en-GB&amp;pal_webapp=barlesque&amp;prod_name=frameworks&amp;app_name=frameworks" width="1"/></p>, <p class="introduction">We polled 177 critics from around the world – here is how they voted.</p>, <p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the `<p>` elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard--every critic entry is embedded in `<strong>` tags. But in order to get the movies attached to that critic--you need to find the `<p>` tag immediately following each `<p><strong>` -- you can do this using next_sibling.

So, you need to build a loop that searches to your `all_p` list:

if it has a `<strong>` tag then 
critic_info = p_line.strong.string
movie_info = p_line.next_sibling

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece. If you want to see the overall architecture of the final loop, I have a commented example at the end of the page--it might not be helpful to look at at this point. See how you do step-by-step and if you get stuck at a step Slack me with your code!



In [197]:
##Write your loop for STEP 3 here
#I started this for you,
#Because you only want it to search starting with each critic
#   if line.strong is not None: does that for you
all_p = soup_doc.find_all("p")
all_movies = []
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.strong.string
        movie_info = lines.next_sibling
        each_critic=[critic_info, movie_info]
        
        all_movies.append(each_critic)
all_movies


[['Simon Abrams – Freelance film critic (US)',
  <p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Rasoulof, 2009)<br/>7. Night Across the Street (Raoul Ruiz, 2012)<br/>8. Certified Copy (Abbas Kiarostami, 2010)<br/>9. Sparrow (Johnnie To, 2008)<br/>10. Fados (Carlos Saura, 2007)</p>],
 ['Sam Adams – Freelance film critic (US)',
  <p>1. In the Mood for Love (Wong Kar-wai, 2000)<br/>2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)<br/>3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)<br/>4. Spirited Away (Hayao Miyazaki, 2001)<br/>5. The Act of Killing (Joshua Oppenheimer, 2012)<br/>6. The Grand Budapest Hotel (Wes Anderson, 2014)<br/>7. The New World (Terrence Malick, 2004)<br/>8. Certified Copy (Abbas Kiarostami, 2010)<br/>9. The World (J

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [198]:
import re
#Practice/Build your regular expressions here
crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
regex_for_name = r"^([\w\s-]+)–" 
regex_for_org = r"–\s([A-Z][\w\s]+)[(]" 
regex_for_cn = r"\s[(][\w\s]+[)]$"   

#r"^([\w\s'-]+)––\s([A-Z][\w\s]+)[(]\s[(][\w\s]+[)]$"
name = re.findall(regex_for_name,crit_sample)
name[0]

#org = re.findall(regex_for_org,crit_sample)
#org[0]

#cn = re.findall(regex_for_cn,crit_sample)
#cn[0]

'Arturo Aguilar '

In [199]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it

import re

all_p = soup_doc.find_all("p")
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.strong.string
        regex = r"^([\w\s'-]+)"
        org_regex = r"–\s([A-Z].+)[(]"
        critic_cn_regex = r"\s[(][\w\s]+[)]$"
        critic_name = re.findall(regex,critic_info)
       
        critic_org = re.findall(org_regex,critic_info)
        print(critic_org[0])
        critic_cn = re.findall(critic_cn_regex,critic_info)
        print(critic_cn[0])



Freelance film critic 
 (US)
Freelance film critic 
 (US)
Freelance film critic 
 (US)
Rolling Stone Mexico 
 (Mexico)
BBC Culture 
 (UK)
The Wrap 
 (US)
Film historian 
 (Italy)
Nerdist 
 (US)
Dipnot TV 
 (Turkey)
The Village Voice 
 (US)
Freelance film critic 
 (Brazil)
Toronto Film Festival 
 (Canada)
BBC Culture 
 (UK)
Freelance film critic 
 (US)
BBC Culture 
 (UK)
La Nacion 
 (Argentina)
Positif 
 (France)
University of Glasgow 
 (UK)
BBC Culture 
 (US)
African Film Festival Inc 
 (US)
Spiegel Online 
 (Germany)
Freelance film critic 
 (India)
The New Yorker 
 (US)
Jerusalem Post 
 (Israel)
The Guardian/BBC Culture 
 (Australia)
Cinemateca de Cuba 
 (Cuba)
New York Times Watching 
 (US)
El Colombiano 
 (Colombia)
Los Angeles Times 
 (US)
AfricaFilms.TV 
 (Senegal)
Freelance film critic 
 (South Korea)
The Telegraph 
 (UK)
IGN 
 (US)
Minneapolis Star-Tribune 
 (US)
Rappler.com 
 (Philippines)
New York University 
 (US)
Fandango 
 (US)
Variety 
 (US)
La Libre Belgique 
 (Belgium)
U

IndexError: list index out of range

**STEP 5**
Now you need to get your **movie names**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. I showed you this in class, but I'll just tell you again how to do this. To get a list of everything that is not a `<BR>` tag, use this method:

`each_movie = movie_info.find_all(string=True)`

This will give you a list called `each_movie`. Which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [243]:
##TakeYou're working loop And add the find_all for each_movie
#And the inner loop that loops through each_movie
all_p = soup_doc.find_all("p")
#all_movies = []
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.strong.string
        movie_info = lines.next_sibling
        each_movie = movie_info.find_all(string=True)
     
        for movie in each_movie:
          print(movie)




1. Mulholland Drive (David Lynch, 2001)
2. In the Mood for Love (Wong Kar-wai, 2000)
3. The Tree of Life (Terrence Malick, 2011)
4. Yi Yi: A One and a Two (Edward Yang, 2000)
5. Goodbye to Language (Jean-Luc Godard, 2014)
6. The White Meadows (Mohammad Rasoulof, 2009)
7. Night Across the Street (Raoul Ruiz, 2012)
8. Certified Copy (Abbas Kiarostami, 2010)
9. Sparrow (Johnnie To, 2008)
10. Fados (Carlos Saura, 2007)
1. In the Mood for Love (Wong Kar-wai, 2000)
2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)
3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)
4. Spirited Away (Hayao Miyazaki, 2001)
5. The Act of Killing (Joshua Oppenheimer, 2012)
6. The Grand Budapest Hotel (Wes Anderson, 2014)
7. The New World (Terrence Malick, 2004)
8. Certified Copy (Abbas Kiarostami, 2010)
9. The World (Jia Zhangke, 2004)
10. Elephant (Gus Van Sant, 2003)
1. Zero Dark Thirty (Kathryn Bigelow, 2012)
2. A History of Violence (David Cronenberg, 2005)
3. The Grand Budapest Hotel (

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [201]:
#Practice/Build your regular expressions here
import re

movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r"^\d\d?[.]\s(.+)[(]([A-Z]\D+),\s(\d{4})"
movie_name = re.findall(regex_for_mname,movie_sample)
movie_name


[('Zero Dark Thirty ', 'Kathryn Bigelow', '2012')]

**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get the movie name.

So now the entire loop should be getting you 13 elements:
-critic_name
-critic_org
-critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instances of:
-rank (this is actually optional, but maybe helpful to keep)
-movie_name
-director
-year

Build this loop using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [202]:
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.strong.string
        movie_info = lines.next_sibling
        each_movie = movie_info.find_all(string=True)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
        for movie in each_movie:
            each_movie_regex = r"^\d\d?[.]\s(.+)[(](\D+),\s(\d{4})"
            this_movie = re.findall(each_movie_regex,movie)
            print(this_movie)

[('Mulholland Drive ', 'David Lynch', '2001')]
[('In the Mood for Love ', 'Wong Kar-wai', '2000')]
[('The Tree of Life ', 'Terrence Malick', '2011')]
[('Yi Yi: A One and a Two ', 'Edward Yang', '2000')]
[('Goodbye to Language ', 'Jean-Luc Godard', '2014')]
[('The White Meadows ', 'Mohammad Rasoulof', '2009')]
[('Night Across the Street ', 'Raoul Ruiz', '2012')]
[('Certified Copy ', 'Abbas Kiarostami', '2010')]
[('Sparrow ', 'Johnnie To', '2008')]
[('Fados ', 'Carlos Saura', '2007')]
[('In the Mood for Love ', 'Wong Kar-wai', '2000')]
[('Eternal Sunshine of the Spotless Mind ', 'Michel Gondry', '2004')]
[('Syndromes and a Century ', 'Apichatpong Weerasethakul', '2006')]
[('Spirited Away ', 'Hayao Miyazaki', '2001')]
[('The Act of Killing ', 'Joshua Oppenheimer', '2012')]
[('The Grand Budapest Hotel ', 'Wes Anderson', '2014')]
[('The New World ', 'Terrence Malick', '2004')]
[('Certified Copy ', 'Abbas Kiarostami', '2010')]
[('The World ', 'Jia Zhangke', '2004')]
[('Elephant ', 'Gus Van S

**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not, don't worry! I will get you through by midweek.

The final step is building a list of lists of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?


In the cell below, I give you a final architecture you need to use to get this most challenging list of lists.

In [203]:
regex_for_mname = r"^\d\d?[.]\s(.+)[(]([A-Z]\D+),\s(\d{4})"
Movie_rank = r"^\d\d?[.]"
critic_name = "\s(.+)[()]"
Movie_director = "([A-Z]\D+)"
movie_year = "\s(\d{4})"
# 

In [205]:
#figure out how you're going to collect your clean information
BBC_critic_movies_list = []

for lines in all_p[:-1]:
    if lines.strong is not None:
    
            
        critic_info = lines.strong.string
        regex = r"^([\w\s'-]+)"
  
        org_regex = r"–\s([A-Z].+)[(]"
        BBC_critic_movies_list
       
        critic_cn_regex = r"\s[(][\w\s]+[)]$"
        critic_name = re.findall(regex,critic_info)
        print(critic_name[0])
        critic_org = re.findall(org_regex,critic_info)
        print(critic_org[0])
        critic_cn = re.findall(critic_cn_regex,critic_info)
        print(critic_cn[0])

        movie_info = lines.next_sibling
        each_movie = movie_info.find_all(string=True)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
        for movie in each_movie:
            
            each_movie_regex = r"^\d\d?[.]\s(.+)[(](\D+),\s(\d{4})"
            this_movie = re.findall(each_movie_regex,movie)[0]
            print(this_movie)
            BBC_list = [critic_name[0], critic_org[0], critic_cn[0],this_movie[0],this_movie[1],this_movie[2]]
            BBC_critic_movies_list.append(BBC_list)                       
            



Simon Abrams 
Freelance film critic 
 (US)
('Mulholland Drive ', 'David Lynch', '2001')
('In the Mood for Love ', 'Wong Kar-wai', '2000')
('The Tree of Life ', 'Terrence Malick', '2011')
('Yi Yi: A One and a Two ', 'Edward Yang', '2000')
('Goodbye to Language ', 'Jean-Luc Godard', '2014')
('The White Meadows ', 'Mohammad Rasoulof', '2009')
('Night Across the Street ', 'Raoul Ruiz', '2012')
('Certified Copy ', 'Abbas Kiarostami', '2010')
('Sparrow ', 'Johnnie To', '2008')
('Fados ', 'Carlos Saura', '2007')
Sam Adams 
Freelance film critic 
 (US)
('In the Mood for Love ', 'Wong Kar-wai', '2000')
('Eternal Sunshine of the Spotless Mind ', 'Michel Gondry', '2004')
('Syndromes and a Century ', 'Apichatpong Weerasethakul', '2006')
('Spirited Away ', 'Hayao Miyazaki', '2001')
('The Act of Killing ', 'Joshua Oppenheimer', '2012')
('The Grand Budapest Hotel ', 'Wes Anderson', '2014')
('The New World ', 'Terrence Malick', '2004')
('Certified Copy ', 'Abbas Kiarostami', '2010')
('The World ', 'Ji

IndexError: list index out of range

In [206]:
BBC_critic_movies_list

[['Simon Abrams ',
  'Freelance film critic ',
  ' (US)',
  'Mulholland Drive ',
  'David Lynch',
  '2001'],
 ['Simon Abrams ',
  'Freelance film critic ',
  ' (US)',
  'In the Mood for Love ',
  'Wong Kar-wai',
  '2000'],
 ['Simon Abrams ',
  'Freelance film critic ',
  ' (US)',
  'The Tree of Life ',
  'Terrence Malick',
  '2011'],
 ['Simon Abrams ',
  'Freelance film critic ',
  ' (US)',
  'Yi Yi: A One and a Two ',
  'Edward Yang',
  '2000'],
 ['Simon Abrams ',
  'Freelance film critic ',
  ' (US)',
  'Goodbye to Language ',
  'Jean-Luc Godard',
  '2014'],
 ['Simon Abrams ',
  'Freelance film critic ',
  ' (US)',
  'The White Meadows ',
  'Mohammad Rasoulof',
  '2009'],
 ['Simon Abrams ',
  'Freelance film critic ',
  ' (US)',
  'Night Across the Street ',
  'Raoul Ruiz',
  '2012'],
 ['Simon Abrams ',
  'Freelance film critic ',
  ' (US)',
  'Certified Copy ',
  'Abbas Kiarostami',
  '2010'],
 ['Simon Abrams ',
  'Freelance film critic ',
  ' (US)',
  'Sparrow ',
  'Johnnie To',
  

In [207]:
import numpy as np
import pandas as pd



In [208]:
col_names = ['critic_name', 'critic_org', 'critic_country', 'movie_name','movie_dir','movie_year']
df = pd.DataFrame.from_records(BBC_critic_movies_list, columns=col_names)
df.loc[53]

critic_name              Tim Appelo 
critic_org                 The Wrap 
critic_country                  (US)
movie_name          Pan's Labyrinth 
movie_dir         Guillermo Del Toro
movie_year                      2006
Name: 53, dtype: object

In [209]:
df.to_csv(r'C:/Users/kanin/DATABASE/movie_country.csv')

In [210]:
df.head()


Unnamed: 0,critic_name,critic_org,critic_country,movie_name,movie_dir,movie_year
0,Simon Abrams,Freelance film critic,(US),Mulholland Drive,David Lynch,2001
1,Simon Abrams,Freelance film critic,(US),In the Mood for Love,Wong Kar-wai,2000
2,Simon Abrams,Freelance film critic,(US),The Tree of Life,Terrence Malick,2011
3,Simon Abrams,Freelance film critic,(US),Yi Yi: A One and a Two,Edward Yang,2000
4,Simon Abrams,Freelance film critic,(US),Goodbye to Language,Jean-Luc Godard,2014


If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic, but they will not be nearly as complicated as this one.

In [211]:
df.groupby('critic_country')['critic_name'].nunique()

critic_country
 (Argentina)        2
 (Australia)        4
 (Austria)          2
 (Bangladesh)       1
 (Belgium)          1
 (Brazil)           1
 (Canada)           5
 (Chile)            2
 (China)            1
 (Colombia)         4
 (Cuba)             5
 (Egypt)            1
 (France)           5
 (Germany)          5
 (Hong Kong)        1
 (India)            5
 (Indonesia)        1
 (Israel)           4
 (Italy)            4
 (Japan)            1
 (Kazakhstan)       1
 (Lebanon)          3
 (Mexico)           2
 (Namibia)          1
 (Philippines)      1
 (Qatar)            1
 (Senegal)          1
 (Singapore)        2
 (South Africa)     1
 (South Korea)      2
 (Switzerland)      1
 (Taiwan)           1
 (Turkey)           2
 (UAE)              3
 (UK)              18
 (US)              82
Name: critic_name, dtype: int64

In [212]:
#Narrowing down my data into a smaller frame
df1 = df.groupby('critic_country')['movie_name'].value_counts().reset_index(name='count')
df1.head()

Unnamed: 0,critic_country,movie_name,count
0,(Argentina),Spirited Away,2
1,(Argentina),Adventureland,1
2,(Argentina),Boyhood,1
3,(Argentina),Elephant,1
4,(Argentina),Extraordinary Stories,1


In [213]:

df1["string"] = "<b>Movie:</b> " + df1["movie_name"] + ": " + df1["count"].map(str) + np.where(df1["count"]>1, ' votes', ' vote')
df1.head()



Unnamed: 0,critic_country,movie_name,count,string
0,(Argentina),Spirited Away,2,<b>Movie:</b> Spirited Away : 2 votes
1,(Argentina),Adventureland,1,<b>Movie:</b> Adventureland : 1 vote
2,(Argentina),Boyhood,1,<b>Movie:</b> Boyhood : 1 vote
3,(Argentina),Elephant,1,<b>Movie:</b> Elephant : 1 vote
4,(Argentina),Extraordinary Stories,1,<b>Movie:</b> Extraordinary Stories : 1 vote


In [214]:
#This is nice but I need to I have only one row per country.
#I do another group, and combine everything together.
#And when I do that I throw some HTML into there


output = df1.groupby('critic_country')['string'].apply(lambda x: "<div id='movie'><P>%s</P></div>" % '</p><p> '.join(x)).reset_index(name='properties.article')
output



Unnamed: 0,critic_country,properties.article
0,(Argentina),<div id='movie'><P><b>Movie:</b> Spirited Away...
1,(Australia),<div id='movie'><P><b>Movie:</b> A Separation ...
2,(Austria),<div id='movie'><P><b>Movie:</b> Platform : 2 ...
3,(Bangladesh),<div id='movie'><P><b>Movie:</b> A Separation ...
4,(Belgium),<div id='movie'><P><b>Movie:</b> Amores Perros...
5,(Brazil),<div id='movie'><P><b>Movie:</b> Birdman : 1 v...
6,(Canada),<div id='movie'><P><b>Movie:</b> In the Mood f...
7,(Chile),<div id='movie'><P><b>Movie:</b> A History of ...
8,(China),<div id='movie'><P><b>Movie:</b> A Separation ...
9,(Colombia),<div id='movie'><P><b>Movie:</b> A Separation ...


In [215]:
output.iloc[3]['properties.article']

"<div id='movie'><P><b>Movie:</b> A Separation : 1 vote</p><p> <b>Movie:</b> Babel : 1 vote</p><p> <b>Movie:</b> Brokeback Mountain : 1 vote</p><p> <b>Movie:</b> In the Mood for Love : 1 vote</p><p> <b>Movie:</b> Oldboy : 1 vote</p><p> <b>Movie:</b> Once Upon a Time in Anatolia : 1 vote</p><p> <b>Movie:</b> Russian Ark : 1 vote</p><p> <b>Movie:</b> Spirited Away : 1 vote</p><p> <b>Movie:</b> Spring, Summer, Fall, Winter…and Spring : 1 vote</p><p> <b>Movie:</b> Ten : 1 vote</P></div>"

In [216]:


crits = df.groupby('critic_country')['critic_name'].nunique().reset_index(name='properties.headline')
crits


Unnamed: 0,critic_country,properties.headline
0,(Argentina),2
1,(Australia),4
2,(Austria),2
3,(Bangladesh),1
4,(Belgium),1
5,(Brazil),1
6,(Canada),5
7,(Chile),2
8,(China),1
9,(Colombia),4


In [217]:
output['critic_country'] = output['critic_country'].str.replace('(','')

In [218]:
output['critic_country'] = output['critic_country'].str.replace(')','')

In [219]:
output['critic_country'] = output['critic_country'].str.strip()

In [220]:
output

Unnamed: 0,critic_country,properties.article
0,Argentina,<div id='movie'><P><b>Movie:</b> Spirited Away...
1,Australia,<div id='movie'><P><b>Movie:</b> A Separation ...
2,Austria,<div id='movie'><P><b>Movie:</b> Platform : 2 ...
3,Bangladesh,<div id='movie'><P><b>Movie:</b> A Separation ...
4,Belgium,<div id='movie'><P><b>Movie:</b> Amores Perros...
5,Brazil,<div id='movie'><P><b>Movie:</b> Birdman : 1 v...
6,Canada,<div id='movie'><P><b>Movie:</b> In the Mood f...
7,Chile,<div id='movie'><P><b>Movie:</b> A History of ...
8,China,<div id='movie'><P><b>Movie:</b> A Separation ...
9,Colombia,<div id='movie'><P><b>Movie:</b> A Separation ...


In [221]:
output = output.merge(crits, how='left', on='critic_country')

In [222]:
output

Unnamed: 0,critic_country,properties.article,properties.headline
0,Argentina,<div id='movie'><P><b>Movie:</b> Spirited Away...,
1,Australia,<div id='movie'><P><b>Movie:</b> A Separation ...,
2,Austria,<div id='movie'><P><b>Movie:</b> Platform : 2 ...,
3,Bangladesh,<div id='movie'><P><b>Movie:</b> A Separation ...,
4,Belgium,<div id='movie'><P><b>Movie:</b> Amores Perros...,
5,Brazil,<div id='movie'><P><b>Movie:</b> Birdman : 1 v...,
6,Canada,<div id='movie'><P><b>Movie:</b> In the Mood f...,
7,Chile,<div id='movie'><P><b>Movie:</b> A History of ...,
8,China,<div id='movie'><P><b>Movie:</b> A Separation ...,
9,Colombia,<div id='movie'><P><b>Movie:</b> A Separation ...,


In [223]:

output['properties.headline'] = output['properties.headline'].map(str) + " critic_name"

In [224]:
output

Unnamed: 0,critic_country,properties.article,properties.headline
0,Argentina,<div id='movie'><P><b>Movie:</b> Spirited Away...,nan critic_name
1,Australia,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name
2,Austria,<div id='movie'><P><b>Movie:</b> Platform : 2 ...,nan critic_name
3,Bangladesh,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name
4,Belgium,<div id='movie'><P><b>Movie:</b> Amores Perros...,nan critic_name
5,Brazil,<div id='movie'><P><b>Movie:</b> Birdman : 1 v...,nan critic_name
6,Canada,<div id='movie'><P><b>Movie:</b> In the Mood f...,nan critic_name
7,Chile,<div id='movie'><P><b>Movie:</b> A History of ...,nan critic_name
8,China,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name
9,Colombia,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name


In [225]:
#Add some color
output['properties.color'] = "#000066"
output

Unnamed: 0,critic_country,properties.article,properties.headline,properties.color
0,Argentina,<div id='movie'><P><b>Movie:</b> Spirited Away...,nan critic_name,#000066
1,Australia,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name,#000066
2,Austria,<div id='movie'><P><b>Movie:</b> Platform : 2 ...,nan critic_name,#000066
3,Bangladesh,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name,#000066
4,Belgium,<div id='movie'><P><b>Movie:</b> Amores Perros...,nan critic_name,#000066
5,Brazil,<div id='movie'><P><b>Movie:</b> Birdman : 1 v...,nan critic_name,#000066
6,Canada,<div id='movie'><P><b>Movie:</b> In the Mood f...,nan critic_name,#000066
7,Chile,<div id='movie'><P><b>Movie:</b> A History of ...,nan critic_name,#000066
8,China,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name,#000066
9,Colombia,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name,#000066


In [226]:
#To get the list of directors
#df.sort_values(by=('director'))
d_list = list(df['movie_dir'].unique())
d_list.sort()
d_list

['Abbas Kiarostami',
 'Abdellatif Kechiche',
 'Abderrahmane Sissako',
 'Adam Curtis',
 'Adam McKay',
 'Agnieszka Holland',
 'Agnès Jaoui',
 'Agnès Varda',
 'Aki Kaurismäki',
 'Alain Cavalier',
 'Alain Gomis',
 'Alain Guiraudie',
 'Alain Resnais',
 'Albert Serra',
 'Alejandro González Iñárritu',
 'Aleksandr Sokurov',
 'Aleksey Fedorchenko',
 'Aleksey German',
 'Alex Garland',
 'Alexander Payne',
 'Alfonso Cuarón',
 'Amma Asante',
 'Ana Lily Amirpour',
 'Andrea Arnold',
 'Andrew Adamson and Vicky Jenson',
 'Andrew Dominik',
 'Andrew Dosunmu',
 'Andrew Haigh',
 'Andrew Lau and Alan Mak',
 'Andrew Stanton',
 'Andrew Stanton and Lee Unkrich',
 'Andrey Zvyagintsev',
 'Andrzej Wajda',
 'Andrzej Zulawski',
 'André Singer',
 'Ang Lee',
 'Annemarie Jacir',
 'Anthony and Joe Russo',
 'Anurag Kashyap',
 'Apichatpong Weerasethakul',
 'Ari Folman',
 'Arnaud Desplechin',
 'Asghar Farhadi',
 'Ashutosh Gowariker',
 'Asif Kapadia',
 'Ava DuVernay',
 'Avi Nesher',
 'Bahman Ghobadi',
 'Bart Layton',
 'Baz

In [227]:
#Some nice imports
import requests
import json
import numpy as np
import pandas as pd
from pandas import json_normalize


In [228]:

##Load the geojson file Exported from Mapshaper

with open("C:/Users/kanin/OneDrive/Desktop/jsonpath/other_countries.json") as json_data:
    geometry_data = json.load(json_data)


In [229]:
##Normalize the hierarchy  so you have simple rows in a dataframe
##Note that you need to extract it from geometry_data['features']
df = pd.DataFrame.from_dict(json_normalize(geometry_data['features']), orient='columns')


In [230]:
df.head()

Unnamed: 0,type,properties.scalerank,properties.featurecla,properties.labelrank,properties.sovereignt,properties.sov_a3,properties.adm0_dif,properties.level,properties.type,properties.admin,...,properties.subregion,properties.region_wb,properties.name_len,properties.long_len,properties.abbrev_len,properties.tiny,properties.homepart,properties.filename,geometry.type,geometry.coordinates
0,Feature,1,Admin-0 country,6,Belize,BLZ,0,2,Sovereign country,Belize,...,Central America,Latin America & Caribbean,6,6,6,-99,1,BLZ.geojson,Polygon,"[[[-89.14308041050332, 17.80831899664932], [-8..."
1,Feature,1,Admin-0 country,4,The Bahamas,BHS,0,2,Sovereign country,The Bahamas,...,Caribbean,Latin America & Caribbean,7,7,4,-99,1,BHS.geojson,MultiPolygon,"[[[[-77.53466, 23.75975], [-77.78, 23.71], [-7..."
2,Feature,1,Admin-0 country,5,Dominican Republic,DOM,0,2,Sovereign country,Dominican Republic,...,Caribbean,Latin America & Caribbean,14,18,9,-99,1,DOM.geojson,Polygon,"[[[-71.71236141629296, 19.714455878167357], [-..."
3,Feature,1,Admin-0 country,2,Canada,CAN,0,2,Sovereign country,Canada,...,Northern America,North America,6,6,4,-99,1,CAN.geojson,MultiPolygon,"[[[[-63.6645, 46.55001], [-62.9393, 46.41587],..."
4,Feature,1,Admin-0 country,3,Cuba,CUB,0,2,Sovereign country,Cuba,...,Caribbean,Latin America & Caribbean,4,4,4,-99,1,CUB.geojson,Polygon,"[[[-82.26815121125706, 23.188610744717703], [-..."


In [231]:
finaldf = output.merge(df, how='left', left_on ='critic_country', right_on='properties.sovereignt')

In [232]:
finaldf

Unnamed: 0,critic_country,properties.article,properties.headline,properties.color,type,properties.scalerank,properties.featurecla,properties.labelrank,properties.sovereignt,properties.sov_a3,...,properties.subregion,properties.region_wb,properties.name_len,properties.long_len,properties.abbrev_len,properties.tiny,properties.homepart,properties.filename,geometry.type,geometry.coordinates
0,Argentina,<div id='movie'><P><b>Movie:</b> Spirited Away...,nan critic_name,#000066,Feature,1.0,Admin-0 country,2.0,Argentina,ARG,...,South America,Latin America & Caribbean,9.0,9.0,4.0,-99.0,1.0,ARG.geojson,MultiPolygon,"[[[[-65.5, -55.2], [-66.45, -55.25], [-66.9599..."
1,Australia,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name,#000066,Feature,1.0,Admin-0 country,2.0,Australia,AU1,...,Australia and New Zealand,East Asia & Pacific,9.0,9.0,4.0,-99.0,1.0,AUS.geojson,MultiPolygon,"[[[[145.39797814349484, -40.79254851660589], [..."
2,Austria,<div id='movie'><P><b>Movie:</b> Platform : 2 ...,nan critic_name,#000066,Feature,1.0,Admin-0 country,4.0,Austria,AUT,...,Western Europe,Europe & Central Asia,7.0,7.0,5.0,-99.0,1.0,AUT.geojson,Polygon,"[[[16.979666782304037, 48.123497015976305], [1..."
3,Bangladesh,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name,#000066,Feature,1.0,Admin-0 country,3.0,Bangladesh,BGD,...,Southern Asia,South Asia,10.0,10.0,5.0,-99.0,1.0,BGD.geojson,Polygon,"[[[92.67272098182556, 22.041238918541254], [92..."
4,Belgium,<div id='movie'><P><b>Movie:</b> Amores Perros...,nan critic_name,#000066,Feature,1.0,Admin-0 country,2.0,Belgium,BEL,...,Western Europe,Europe & Central Asia,7.0,7.0,5.0,-99.0,1.0,BEL.geojson,Polygon,"[[[3.314971144228537, 51.345780951536085], [4...."
5,Brazil,<div id='movie'><P><b>Movie:</b> Birdman : 1 v...,nan critic_name,#000066,Feature,1.0,Admin-0 country,2.0,Brazil,BRA,...,South America,Latin America & Caribbean,6.0,6.0,6.0,-99.0,1.0,BRA.geojson,Polygon,"[[[-57.62513342958296, -30.216294854454258], [..."
6,Canada,<div id='movie'><P><b>Movie:</b> In the Mood f...,nan critic_name,#000066,Feature,1.0,Admin-0 country,2.0,Canada,CAN,...,Northern America,North America,6.0,6.0,4.0,-99.0,1.0,CAN.geojson,MultiPolygon,"[[[[-63.6645, 46.55001], [-62.9393, 46.41587],..."
7,Chile,<div id='movie'><P><b>Movie:</b> A History of ...,nan critic_name,#000066,Feature,1.0,Admin-0 country,2.0,Chile,CHL,...,South America,Latin America & Caribbean,5.0,5.0,5.0,-99.0,1.0,CHL.geojson,MultiPolygon,"[[[[-68.63401022758316, -52.63637045887437], [..."
8,China,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name,#000066,Feature,1.0,Admin-0 country,2.0,China,CH1,...,Eastern Asia,East Asia & Pacific,5.0,5.0,5.0,-99.0,1.0,CHN.geojson,MultiPolygon,"[[[[110.33918786015154, 18.678395087147607], [..."
9,Colombia,<div id='movie'><P><b>Movie:</b> A Separation ...,nan critic_name,#000066,Feature,1.0,Admin-0 country,2.0,Colombia,COL,...,South America,Latin America & Caribbean,8.0,8.0,4.0,-99.0,1.0,COL.geojson,Polygon,"[[[-75.37322323271385, -0.15203175212045], [-7..."


In [234]:
ok_json = json.loads(finaldf.to_json(orient='records'))


In [235]:
ok_json

[{'critic_country': 'Argentina',
  'properties.article': "<div id='movie'><P><b>Movie:</b> Spirited Away : 2 votes</p><p> <b>Movie:</b> Adventureland : 1 vote</p><p> <b>Movie:</b> Boyhood : 1 vote</p><p> <b>Movie:</b> Elephant : 1 vote</p><p> <b>Movie:</b> Extraordinary Stories : 1 vote</p><p> <b>Movie:</b> In the Mood for Love : 1 vote</p><p> <b>Movie:</b> Jersey Boys : 1 vote</p><p> <b>Movie:</b> Mad Max: Fury Road : 1 vote</p><p> <b>Movie:</b> Mia Madre : 1 vote</p><p> <b>Movie:</b> Moulin Rouge! : 1 vote</p><p> <b>Movie:</b> Mulholland Drive : 1 vote</p><p> <b>Movie:</b> Nine Queens : 1 vote</p><p> <b>Movie:</b> Open Range : 1 vote</p><p> <b>Movie:</b> Right Now, Wrong Then : 1 vote</p><p> <b>Movie:</b> The Social Network : 1 vote</p><p> <b>Movie:</b> The Son's Room : 1 vote</p><p> <b>Movie:</b> Toy Story 3 : 1 vote</p><p> <b>Movie:</b> Uncle Boonmee Who Can Recall His Past Lives : 1 vote</p><p> <b>Movie:</b> WALL-E : 1 vote</P></div>",
  'properties.headline': 'nan critic_name',
 

In [237]:
def process_to_geojson(file):
    geo_data = {"type": "FeatureCollection", "features":[]}
    for row in file:
        this_dict = {"type": "Feature", "properties":{}, "geometry": {}}
        for key, value in row.items():
            key_names = key.split('.')
            if key_names[0] == 'geometry':
                this_dict['geometry'][key_names[1]] = value
            if str(key_names[0]) == 'properties':
                this_dict['properties'][key_names[1]] = value
        geo_data['features'].append(this_dict)
    return geo_data

In [238]:
geo_format = process_to_geojson(ok_json)


In [239]:
geo_format

{'type': 'FeatureCollection',
 'features': [{'type': 'Feature',
   'properties': {'article': "<div id='movie'><P><b>Movie:</b> Spirited Away : 2 votes</p><p> <b>Movie:</b> Adventureland : 1 vote</p><p> <b>Movie:</b> Boyhood : 1 vote</p><p> <b>Movie:</b> Elephant : 1 vote</p><p> <b>Movie:</b> Extraordinary Stories : 1 vote</p><p> <b>Movie:</b> In the Mood for Love : 1 vote</p><p> <b>Movie:</b> Jersey Boys : 1 vote</p><p> <b>Movie:</b> Mad Max: Fury Road : 1 vote</p><p> <b>Movie:</b> Mia Madre : 1 vote</p><p> <b>Movie:</b> Moulin Rouge! : 1 vote</p><p> <b>Movie:</b> Mulholland Drive : 1 vote</p><p> <b>Movie:</b> Nine Queens : 1 vote</p><p> <b>Movie:</b> Open Range : 1 vote</p><p> <b>Movie:</b> Right Now, Wrong Then : 1 vote</p><p> <b>Movie:</b> The Social Network : 1 vote</p><p> <b>Movie:</b> The Son's Room : 1 vote</p><p> <b>Movie:</b> Toy Story 3 : 1 vote</p><p> <b>Movie:</b> Uncle Boonmee Who Can Recall His Past Lives : 1 vote</p><p> <b>Movie:</b> WALL-E : 1 vote</P></div>",
    'head

In [242]:
#Variable name
with open('geo-data12-8.js', 'w') as outfile:
    outfile.write("var infoData = ")
#geojson output
with open('geo-data12-8.js', 'a') as outfile:
    json.dump(geo_format, outfile)
