> ## Disney Dataset Creation (w/ Python BeautifulSoup)


The idea is to scrape the Wikipedia page on Disney movies over the decades and create a data set with information on all these movies.

***
### Task #1: Get Info Box (store in Python dictionary)
***

#### Importing Necessary Libraries

In [1]:
from bs4 import BeautifulSoup as bs
import requests

#### Loading the webpage

In [2]:
r = requests.get("https://en.wikipedia.org/wiki/Toy_Story_3")

# Convert it to a beautiful soup object
soup = bs(r.content)

In [3]:
info_box = soup.find(class_ = "infobox vevent") # HTML table class that apparently contains all the information we are looking for
info_rows = info_box.find_all("tr")

for row in info_rows:
    print(row.prettify())

<tr>
 <th class="infobox-above summary" colspan="2" style="font-size: 125%; font-style: italic;">
  Toy Story 3
 </th>
</tr>

<tr>
 <td class="infobox-image" colspan="2">
  <a class="image" href="/wiki/File:Toy_Story_3_poster.jpg" title="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3.">
   <img alt="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3." class="thumbborder" data-file-height="326" data-file-width="220" decoding="async" height="326" src="//upload.wikimedia.org/wikipedia/en/6/69/Toy_Story_3_poster.jpg" width="220"/>
  </a>
  <div class="infobox-caption">
   Theatrical release poster
  </div>
 </td>
</tr>

<tr>
 <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;">
  Directed by
 </th>
 <td class="

In [4]:
movie_info = {}

# Some of our HTML data are actually defined as a list of values, so we must handle it
def get_content_value(row_data):
    if row_data.find("li"): # 'li' is the way HTML define lists
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")] # So we get all the text data from the HTML list and store it in a Python list
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ") # If not a list, then just store it as usual

    
for index, row in enumerate(info_rows):
    if index == 0:
        movie_info["title"] = row.find("th").get_text(" ", strip=True) # 'th' means Table Head
    elif index == 1:
        continue # we're ignoring the picture from the info box
    else:
        content_key = row.find("th").get_text(" ", strip=True)
        content_value = get_content_value(row.find("td")) # 'td' means Table Data
        movie_info[content_key] = content_value
        

movie_info

{'title': 'Toy Story 3',
 'Directed by': 'Lee Unkrich',
 'Screenplay by': 'Michael Arndt',
 'Story by': ['John Lasseter', 'Andrew Stanton', 'Lee Unkrich'],
 'Produced by': 'Darla K. Anderson',
 'Starring': ['Tom Hanks',
  'Tim Allen',
  'Joan Cusack',
  'Don Rickles',
  'Wallace Shawn',
  'John Ratzenberger',
  'Estelle Harris',
  'Ned Beatty',
  'Michael Keaton',
  'Jodi Benson',
  'John Morris'],
 'Cinematography': ['Jeremy Lasky', 'Kim White'],
 'Edited by': 'Ken Schretzmann',
 'Music by': 'Randy Newman',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release dates': ['June 12, 2010 ( 2010-06-12 ) ( Taormina Film Fest )',
  'June 18, 2010 ( 2010-06-18 ) (United States)'],
 'Running time': '103 minutes [1]',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 million [1]',
 'Box office': '$1.067 billion [1]'}

***
### Task #2: Get Info Box for all Movies
***

Our goal is to have a list of dictionaries, each dictionary representing the wikipedia info box for a movie.

In [5]:
request_all_movies = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")
soup_all_movies = bs(request_all_movies.content)

In [6]:
movies = soup_all_movies.select(".wikitable.sortable i a")
movies

[<a href="/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)" title="Snow White and the Seven Dwarfs (1937 film)">Snow White and the Seven Dwarfs</a>,
 <a href="/wiki/Pinocchio_(1940_film)" title="Pinocchio (1940 film)">Pinocchio</a>,
 <a href="/wiki/Fantasia_(1940_film)" title="Fantasia (1940 film)">Fantasia</a>,
 <a href="/wiki/The_Reluctant_Dragon_(1941_film)" title="The Reluctant Dragon (1941 film)">The Reluctant Dragon</a>,
 <a href="/wiki/Dumbo" title="Dumbo">Dumbo</a>,
 <a href="/wiki/Bambi" title="Bambi">Bambi</a>,
 <a href="/wiki/Saludos_Amigos" title="Saludos Amigos">Saludos Amigos</a>,
 <a href="/wiki/Victory_Through_Air_Power_(film)" title="Victory Through Air Power (film)">Victory Through Air Power</a>,
 <a href="/wiki/The_Three_Caballeros" title="The Three Caballeros">The Three Caballeros</a>,
 <a href="/wiki/Make_Mine_Music" title="Make Mine Music">Make Mine Music</a>,
 <a href="/wiki/Song_of_the_South" title="Song of the South">Song of the South</a>,
 <a href="/wiki/Fun_

In [7]:
# We will reuse the previous code
def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    elif row_data.find("br"): # Solving: Split up long strings into lists
        return [text for text in row_data.stripped_strings]
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ") 

def clean_tags(soup): # Solving: clean up the references '[]'
    for tag in soup.find_all(["sup", "span"]):
        tag.decompose()
    
def get_info_box(url):    
    request = requests.get(url)
    soup = bs(request.content)    
    info_box = soup.find(class_ = "infobox vevent")
    info_rows = info_box.find_all("tr")
 
    clean_tags(soup)
    
    movie_info = {} 
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info["title"] = row.find("th").get_text(" ", strip=True) 
        elif index == 1:
            continue 
        else:
            header = row.find("th")
            if header:
                content_key = row.find("th").get_text(" ", strip=True)
                content_value = get_content_value(row.find("td"))
                movie_info[content_key] = content_value

    return movie_info

In [8]:
get_info_box("https://en.wikipedia.org/wiki/Davy_Crockett_and_the_River_Pirates")

{'title': 'Davy Crockett and the River Pirates',
 'Directed by': 'Norman Foster',
 'Written by': ['Tom Blackburn', 'Norman Foster'],
 'Produced by': 'Bill Walsh',
 'Starring': ['Fess Parker', 'Buddy Ebsen', 'Jeff York'],
 'Cinematography': 'Bert Glennon',
 'Edited by': 'Stanley Johnson',
 'Music by': ['Thomas W. Blackburn (lyrics)',
  'George Bruns',
  'Edward H. Plumb (orchestration)'],
 'Color process': 'Technicolor',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Film Distribution Co., Inc.',
 'Release date': ['July 18, 1956'],
 'Running time': '81 minutes',
 'Country': 'United States',
 'Language': 'English'}

In [9]:
# Careful when running this for loop, it may take a while. You should have
# access to a JSON file containing all the information of this scraping

#movie_info_list = []
#base_path = "https://en.wikipedia.org/"

#for index, movie in enumerate(movies):
#   try:
#       relative_path = movie['href']
#       full_path = base_path + relative_path
#       title = movie['title']
#       
#       movie_info_list.append(get_info_box(full_path))
#   except Exception as e:
#       print(movie.get_text())
#       print(e)

#### Save/Realod Movie Data
Scraping all of this information takes a lot of time, so we'll save it to a JSON file to be able to access it more easily.

In [10]:
import json

def save_data(title, data):
    with open(title, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

In [11]:
def load_data(title):
    with open(title, encoding="utf-8") as f:
        return json.load(f)

In [12]:
# save_data("disney_data.json", movie_info_list)

***
### Task #3: Clean Up our Data !
***

In [13]:
movie_info_list = load_data("disney_data.json")

#### Subtasks:

  * Clean up references (remove '[ ]');
  * Convert running time into an integer;
  * Convert date into datetime object;
  * Split up long strings into lists;
  * Convert Budget & Box office to numbers;
  * Try to fix some of the errors we got when scraping the all Wikipedia page.
  
 We should probably go back to our code in which we got the info box from Wikipedia to solve these issues.  

>  #### Clean up references (remove [1], [2], etc.)

For that, note that for every reference of this kind there is an HTML tag `sup` associated with it, so we should find all of these tags and remove it. Thus, we went back to our previous code and added `clean_tags()`, which removes content associated with some tags.

>  #### Split up long strings into lists

Here we notice that this is happening because the HTML code is not using `li` tags to define list-like content. Instead, they're using the `br` tag in order to get a similar result. Thus, we added a new `elif` statement to `get_content_value()` to find these tags and split the string content in between the tags. Then our code at `get_info_box()`finds the splited elements and assign it to a list.

>  #### Try to fix some of the errors we got when scraping the all Wikipedia page.

Some of these errors were happening because the **Info Box** of the movie page didn't had a `th` main HTML tag. In that case our code couldn't access the information because it primarily access the **Table Head** to obtain all the text. Thus, we simply add `if header` to `get_info_box()`, so that it only proceeds scraping if there is a `th` tag.

Other errors occured because the respective Wikipedia page of the movie didn't had an info box, so we just ignore it.

### At this point we should re-run our scraping and load it into a new JSON file


In [14]:
movie_info_list = load_data("dataset_checkpoints\\disney_data_cleaned.json")

>  #### Convert running time into an integer


In [15]:
[movie.get('Running time', 'N/A') for movie in movie_info_list]

['41 minutes (74 minutes 1966 release)',
 '83 minutes',
 '88 minutes',
 '126 minutes',
 '74 minutes',
 '64 minutes',
 '70 minutes',
 '42 minutes',
 '65 min.',
 '71 minutes',
 '75 minutes',
 '94 minutes',
 '73 minutes',
 '75 minutes',
 '82 minutes',
 '68 minutes',
 '74 minutes',
 '96 minutes',
 '75 minutes',
 '84 minutes',
 '77 minutes',
 '92 minutes',
 '69 minutes',
 '81 minutes',
 ['60 minutes (VHS version)', '71 minutes (original)'],
 '127 minutes',
 '92 minutes',
 '76 minutes',
 '75 minutes',
 '73 minutes',
 '85 minutes',
 '81 minutes',
 '70 minutes',
 '90 min.',
 '80 minutes',
 '75 minutes',
 '83 minutes',
 '83 minutes',
 '72 minutes',
 '97 minutes',
 '75 minutes',
 '104 minutes',
 '93 minutes',
 '105 minutes',
 '95 minutes',
 '97 minutes',
 '134 minutes',
 '69 minutes',
 '92 minutes',
 '126 minutes',
 '79 minutes',
 '97 minutes',
 '128 minutes',
 '74 minutes',
 '91 minutes',
 '105 minutes',
 '98 minutes',
 '130 minutes',
 '89 min.',
 '93 minutes',
 '67 minutes',
 '98 minutes',
 '1

In [16]:
def minutes_to_integer(running_time):
    if running_time == "N/A":
        return None
    if isinstance(running_time, list):
        first_entry = running_time[0]
        return int(first_entry.split(" ")[0])
    else:
        return int(running_time.split(" ")[0])

for movie in movie_info_list:
    movie['Running Time (minutes)'] = minutes_to_integer(movie.get('Running time', 'N/A'))

In [17]:
movie_info_list[-10]

{'title': 'Black Is King',
 'Directed by': ['Beyoncé Knowles-Carter',
  'Kwasi Fordjour',
  'Emmanuel Adjei',
  'Blitz Bazawule',
  'Ibra Ake',
  'Jenn Nkiru',
  'Jake Nava',
  'Pierre Debusschere',
  'Dikayl Rimmasch'],
 'Produced by': ['Jeremy Sullivan',
  'Jimi Adesanya',
  'Blitz Bazawule',
  'Ben Cooper',
  'Astrid Edwards',
  'Durwin Julies',
  'Yoli Mes',
  'Dafe Oboro',
  'Akin Omotoso',
  'Will Whitney',
  'Lauren Baker',
  'Jason Baum',
  'Alex Chamberlain',
  'Robert Day',
  'Christophe Faubert',
  'Brien Justiniano',
  'Rethabile Molatela Mothobi',
  'Sylvia Zakhary',
  'Nathan Scherrer',
  'Erinn Williams'],
 'Written by': ['Beyoncé Knowles-Carter',
  'Yrsa Daley-Ward',
  'Clover Hope',
  'Andrew Morrow'],
 'Based on': ['The Lion King: The Gift'],
 'Starring': ['Beyoncé',
  'Folajomi Akinmurele',
  'Connie Chiume',
  'Nyaniso Ntsikelelo Dzedze',
  'Nandi Madida',
  'Warren Masemola',
  'Sibusiso Mbeje',
  'Fumi Odede',
  'Stephen Ojo',
  'Mary Twala'],
 'Music by': ['James

In [18]:
[movie.get('Running Time (minutes)', 'N/A') for movie in movie_info_list]

[41,
 83,
 88,
 126,
 74,
 64,
 70,
 42,
 65,
 71,
 75,
 94,
 73,
 75,
 82,
 68,
 74,
 96,
 75,
 84,
 77,
 92,
 69,
 81,
 60,
 127,
 92,
 76,
 75,
 73,
 85,
 81,
 70,
 90,
 80,
 75,
 83,
 83,
 72,
 97,
 75,
 104,
 93,
 105,
 95,
 97,
 134,
 69,
 92,
 126,
 79,
 97,
 128,
 74,
 91,
 105,
 98,
 130,
 89,
 93,
 67,
 98,
 100,
 118,
 103,
 110,
 80,
 79,
 91,
 91,
 97,
 118,
 139,
 92,
 131,
 87,
 116,
 93,
 110,
 110,
 131,
 101,
 108,
 84,
 78,
 75,
 164,
 106,
 110,
 99,
 113,
 108,
 112,
 93,
 91,
 93,
 100,
 100,
 79,
 96,
 113,
 89,
 118,
 92,
 88,
 92,
 87,
 93,
 93,
 93,
 90,
 83,
 96,
 88,
 89,
 91,
 93,
 92,
 97,
 100,
 100,
 89,
 91,
 112,
 115,
 95,
 91,
 95,
 104,
 74,
 48,
 77,
 104,
 128,
 101,
 94,
 104,
 90,
 100,
 88,
 93,
 98,
 100,
 112,
 84,
 98,
 97,
 114,
 96,
 100,
 109,
 83,
 90,
 107,
 96,
 103,
 91,
 95,
 105,
 113,
 80,
 101,
 89,
 74,
 90,
 89,
 110,
 74,
 93,
 84,
 83,
 69,
 77,
 107,
 93,
 88,
 108,
 84,
 121,
 89,
 104,
 90,
 86,
 84,
 108,
 107,
 96,
 98,
 

>  #### Convert Budget & Box office to numbers

In [19]:
[movie.get('Budget', 'N/A') for movie in movie_info_list]

['N/A',
 '$1.49 million',
 '$2.6 million',
 '$2.28 million',
 '$600,000',
 '$950,000',
 '$858,000',
 'N/A',
 '$788,000',
 'N/A',
 '$1.35 million',
 '$2.125 million',
 'N/A',
 '$1.5 million',
 '$1.5 million',
 'N/A',
 '$2.9 million',
 '$1,800,000',
 '$3 million',
 'N/A',
 '$4 million',
 '$2 million',
 '$300,000',
 '$1.8 million',
 'N/A',
 '$5 million',
 'N/A',
 '$4 million',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 '$700,000',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 '$6 million',
 'under $1 million or $1,250,000',
 'N/A',
 '$2 million',
 'N/A',
 'N/A',
 '$2.5 million',
 'N/A',
 'N/A',
 '$4 million',
 '$3.6 million',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 '$3 million',
 'N/A',
 '$3 million',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 '$3 million',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 '$4.4–6 million',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 '$4 million',
 'N/A',
 '$5 million',
 'N/A',
 'N/A',
 'N/A',
 'N/A',
 '

In [20]:
import re

amounts = r"thousand|million|billion"
number = r"\d+(,\d*)*\.*\d*"

word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})"
value_re = rf"\${number}"  

In [21]:
def word_to_value(word):
    value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value*word_value
    
def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    return value

def money_conversion(money):
    
    if money == "N/A":
        return None
    
    if isinstance(money, list):
        money = money[0]
    
    word_syntax = re.search(word_re, money, flags=re.I)
    value_syntax = re.search(value_re, money)
    
    if word_syntax:
        return parse_word_syntax(word_syntax.group())
        
    elif value_syntax:
        return parse_value_syntax(value_syntax.group())
    
    else:
        return None

In [22]:
for movie in movie_info_list:
    movie['Budget (float)'] = money_conversion(movie.get("Budget", "N/A"))
    movie['Box office (float)'] = money_conversion(movie.get("Box office", "N/A"))

In [23]:
movie_info_list[-40]

{'title': 'Beauty and the Beast',
 'Directed by': 'Bill Condon',
 'Produced by': ['David Hoberman', 'Todd Lieberman'],
 'Screenplay by': ['Stephen Chbosky', 'Evan Spiliotopoulos'],
 'Based on': ["Disney 's Beauty and the Beast by Linda Woolverton",
  'Beauty and the Beast by Jeanne-Marie Leprince de Beaumont'],
 'Starring': ['Emma Watson',
  'Dan Stevens',
  'Luke Evans',
  'Kevin Kline',
  'Josh Gad',
  'Ewan McGregor',
  'Stanley Tucci',
  'Audra McDonald',
  'Gugu Mbatha-Raw',
  'Ian McKellen',
  'Emma Thompson'],
 'Music by': 'Alan Menken',
 'Cinematography': 'Tobias A. Schliessler',
 'Edited by': 'Virginia Katz',
 'Production company': ['Walt Disney Pictures', 'Mandeville Films'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release date': ['February 23, 2017 ( Spencer House )',
  'March 17, 2017 (United States)'],
 'Running time': '129 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$160–255 million',
 'Box office': '$1.264 billion',

> #### Convert dates into datetime object

In [24]:
[movie.get('Release date', 'N/A') for movie in movie_info_list]

[['May 19, 1937'],
 ['December 21, 1937 ( Carthay Circle Theatre , Los Angeles , CA )',
  'February 4, 1938 (United States)'],
 ['February 7, 1940 ( Center Theatre )', 'February 23, 1940 (United States)'],
 ['November 13, 1940'],
 ['June 20, 1941'],
 ['October 23, 1941 (New York City)', 'October 31, 1941 (U.S.)'],
 ['August 9, 1942 (World Premiere-London)',
  'August 13, 1942 (Premiere-New York City)',
  'August 21, 1942 (U.S.)'],
 ['August 24, 1942 (World Premiere-Rio de Janeiro)',
  'February 6, 1943 (U.S. Premiere-Boston)',
  'February 19, 1943 (U.S.)'],
 ['July 17, 1943'],
 ['December 21, 1944 (Mexico City)', 'February 3, 1945 (US)'],
 ['April 20, 1946 (Premiere-New York City)', 'August 15, 1946 (U.S.)'],
 ['November 12, 1946 (Premiere: Atlanta, Georgia)', 'November 20, 1946'],
 ['September 27, 1947'],
 'May 27, 1948',
 ['November 29, 1948 (Chicago, Illinois)',
  'January 19, 1949 (Indianapolis, Indiana)'],
 ['October 5, 1949'],
 ['February 15, 1950 (Boston)', 'March 4, 1950 (Unite

In [25]:
from datetime import datetime

dates = [movie.get('Release date', 'N/A') for movie in movie_info_list]

def clean_date(date):
    return date.split("(")[0].strip()

def date_conversion(date):
    if isinstance(date, list):
        date = date[0]
    
    if date == "N/A":
        return None
    
    date_str = clean_date(date)
    
    fmts = ["%B %d, %Y", "%B %d %Y"]
    
    for fmt in fmts:
        try:
            return datetime.strptime(date_str, fmt)
        except:
            pass
    return None


In [26]:
for movie in movie_info_list:
    movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A'))

In [27]:
[movie['Release date (datetime)'] for movie in movie_info_list]

[datetime.datetime(1937, 5, 19, 0, 0),
 datetime.datetime(1937, 12, 21, 0, 0),
 datetime.datetime(1940, 2, 7, 0, 0),
 datetime.datetime(1940, 11, 13, 0, 0),
 datetime.datetime(1941, 6, 20, 0, 0),
 datetime.datetime(1941, 10, 23, 0, 0),
 datetime.datetime(1942, 8, 9, 0, 0),
 datetime.datetime(1942, 8, 24, 0, 0),
 datetime.datetime(1943, 7, 17, 0, 0),
 datetime.datetime(1944, 12, 21, 0, 0),
 datetime.datetime(1946, 4, 20, 0, 0),
 datetime.datetime(1946, 11, 12, 0, 0),
 datetime.datetime(1947, 9, 27, 0, 0),
 datetime.datetime(1948, 5, 27, 0, 0),
 datetime.datetime(1948, 11, 29, 0, 0),
 datetime.datetime(1949, 10, 5, 0, 0),
 datetime.datetime(1950, 2, 15, 0, 0),
 datetime.datetime(1950, 6, 22, 0, 0),
 datetime.datetime(1951, 7, 26, 0, 0),
 datetime.datetime(1952, 3, 13, 0, 0),
 datetime.datetime(1953, 2, 5, 0, 0),
 datetime.datetime(1953, 8, 8, 0, 0),
 datetime.datetime(1953, 11, 10, 0, 0),
 None,
 datetime.datetime(1954, 8, 17, 0, 0),
 datetime.datetime(1954, 12, 23, 0, 0),
 datetime.date

#### At this point we should save our data again. I will use Picke to do it.

In [28]:
import pickle

def save_data_pickle(name, data):
    with open(name, 'wb') as f:
        picke.dump(data, f)

def load_data_pickle(name):
    with open(name, 'rb') as f:
        return pickle.load(f)

In [29]:
#save_data_pickle("disney_movie_data_cleaned_More.pickle", movie_info_list)

***
### Task #4: Attach IMDB/Rotten Tomatoes/Metascore scores
***

In [30]:
movie_info_list = load_data_pickle('dataset_checkpoints\\disney_movie_data_cleaned_more.pickle')

In [31]:
# http://www.omdbapi.com/?apikey=[yourkey]&

In [32]:
import os
import urllib

def get_omdb_info(title):
    base_url = "http://www.omdbapi.com/?"
    parameters = {"apikey": os.environ(OMDB_API_KEY), 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

def get_rotten_tomatoes_score(omdb_info):
    ratings = omdb_info.get('Ratings', [])
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None

In [33]:
# Careful when running this code, it may take a while to run
# for movie in movie_info_list:
#     title = movie['title']
#     omdb_info = get_omdb_info(title)
#     movie['imdb'] = omdb_info.get('imdbRating', None)
#     movie['metascore'] = omdb_info.get('Metascore', None)
#     movie['rotten_tomatoes'] = get_rotten_tomatoes_score(omdb_info)

In [34]:
movie_info_list[-50]

{'title': 'The Jungle Book',
 'Directed by': 'Jon Favreau',
 'Produced by': ['Jon Favreau', 'Brigham Taylor'],
 'Written by': 'Justin Marks',
 'Based on': ['The Jungle Book', 'by', 'Rudyard Kipling'],
 'Starring': ['Bill Murray',
  'Ben Kingsley',
  'Idris Elba',
  "Lupita Nyong'o",
  'Scarlett Johansson',
  'Giancarlo Esposito',
  'Christopher Walken',
  'Neel Sethi'],
 'Music by': 'John Debney',
 'Cinematography': 'Bill Pope',
 'Edited by': 'Mark Livolsi',
 'Production company': ['Walt Disney Pictures', 'Fairview Entertainment'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release date': ['April 4, 2016 ( El Capitan Theatre )',
  'April 15, 2016 (United States)'],
 'Running time': '106 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$175–177 million',
 'Box office': '$966.6 million',
 'Running time (int)': 106,
 'Budget (float)': 175000000.0,
 'Box office (float)': 966600000.0,
 'Release date (datetime)': datetime.datetime(2016, 4, 4, 0

In [35]:
#save_data_pickle('disney_movie_data_final.pickle', movie_info_list)

***
### Task #5: Save data as JSON & CSV
***

In [36]:
movie_info_copy = [movie.copy() for movie in movie_info_list]

In [37]:
for movie in movie_info_copy:
    current_date = movie['Release date (datetime)']
    if current_date:
        movie['Release date (datetime)'] = current_date.strftime("%B %d, %Y") 
    else:
        movie['Release date (datetime)'] = None

In [38]:
movie_info_copy[20]

{'title': 'Peter Pan',
 'Directed by': ['Clyde Geronimi', 'Wilfred Jackson', 'Hamilton Luske'],
 'Produced by': 'Walt Disney',
 'Story by': ['Milt Banta',
  'Bill Cottrell',
  'Winston Hibler',
  'Bill Peet',
  'Erdman Penner',
  'Joe Rinaldi',
  'Ted Sears',
  'Ralph Wright'],
 'Based on': ['Peter and Wendy', 'by', 'J.M. Barrie'],
 'Starring': ['Bobby Driscoll',
  'Kathryn Beaumont',
  'Hans Conried',
  'Paul Collins',
  'Tommy Luske'],
 'Narrated by': 'Tom Conway',
 'Music by': 'Oliver Wallace',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release date': ['February 5, 1953 (United States)'],
 'Running time': '77 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$4 million',
 'Box office': '$87.4 million',
 'Running time (int)': 77,
 'Budget (float)': 4000000.0,
 'Box office (float)': 87400000.0,
 'Release date (datetime)': 'February 05, 1953'}

In [39]:
#save_data("disney_data_final.json", movie_info_copy)

#### Convert data to CSV

In [40]:
import pandas as pd

df = pd.DataFrame(movie_info_list)

In [41]:
df.head()

Unnamed: 0,title,Production company,Release date,Running time,Country,Language,Running time (int),Budget (float),Box office (float),Release date (datetime),...,Box office,Story by,Narrated by,Cinematography,Edited by,Screenplay by,Production companies,Adaptation by,Traditional,Simplified
0,Academy Award Review of,Walt Disney Productions,"[May 19, 1937]",41 minutes (74 minutes 1966 release),United States,English,41.0,,,1937-05-19,...,,,,,,,,,,
1,Snow White and the Seven Dwarfs,Walt Disney Productions,"[December 21, 1937 ( Carthay Circle Theatre , ...",83 minutes,United States,English,83.0,1490000.0,418000000.0,1937-12-21,...,$418 million,,,,,,,,,
2,Pinocchio,Walt Disney Productions,"[February 7, 1940 ( Center Theatre ), February...",88 minutes,United States,English,88.0,2600000.0,164000000.0,1940-02-07,...,$164 million,"[Ted Sears, Otto Englander, Webb Smith, Willia...",,,,,,,,
3,Fantasia,Walt Disney Productions,"[November 13, 1940]",126 minutes,United States,English,126.0,2280000.0,83300000.0,1940-11-13,...,$76.4–$83.3 million,"[Joe Grant, Dick Huemer]",Deems Taylor,James Wong Howe,,,,,,
4,The Reluctant Dragon,Walt Disney Productions,"[June 20, 1941]",74 minutes,United States,English,74.0,600000.0,960000.0,1941-06-20,...,"$960,000 (worldwide rentals)",,,Bert Giennon,Paul Weatherwax,,,,,


In [42]:
#df.to_csv("dusney_movie_data_final.csv")

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432 entries, 0 to 431
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   title                    432 non-null    object        
 1   Production company       392 non-null    object        
 2   Release date             431 non-null    object        
 3   Running time             422 non-null    object        
 4   Country                  428 non-null    object        
 5   Language                 430 non-null    object        
 6   Running time (int)       422 non-null    float64       
 7   Budget (float)           273 non-null    float64       
 8   Box office (float)       355 non-null    float64       
 9   Release date (datetime)  428 non-null    datetime64[ns]
 10  Directed by              431 non-null    object        
 11  Produced by              422 non-null    object        
 12  Written by               203 non-nul

In [44]:
df.describe()

Unnamed: 0,Running time (int),Budget (float),Box office (float)
count,422.0,273.0,355.0
mean,97.305687,63588610.0,167401400.0
std,18.959487,71637320.0,274951700.0
min,40.0,150.0,7.7
25%,86.0,10000000.0,9850000.0
50%,96.0,30000000.0,42900000.0
75%,106.75,100000000.0,186550000.0
max,168.0,410600000.0,1657000000.0


In [45]:
running_times = df.sort_values(['Running time (int)'], ascending=True).reset_index()
running_times.head()

Unnamed: 0,index,title,Production company,Release date,Running time,Country,Language,Running time (int),Budget (float),Box office (float),...,Box office,Story by,Narrated by,Cinematography,Edited by,Screenplay by,Production companies,Adaptation by,Traditional,Simplified
0,289,Roving Mars,"[Walt Disney Pictures, White Mountain Films, T...","[January 27, 2006]",40 minutes,United States,English,40.0,1000000.0,11000000.0,...,$11 million,,Paul Newman (introduction only),T.C. Christensen,Nancy Baker,,,,,
1,272,Sacred Planet,Walt Disney Pictures,"[April 22, 2004]",40 minutes,"[Canada, Malaysia, United States]",English,40.0,,1108356.0,...,"$1,108,356",,Robert Redford,William Reeve,Jon Long,,,,,
2,0,Academy Award Review of,Walt Disney Productions,"[May 19, 1937]",41 minutes (74 minutes 1966 release),United States,English,41.0,,,...,,,,,,,,,,
3,7,Saludos Amigos,Walt Disney Productions,"[August 24, 1942 (World Premiere-Rio de Janeir...",42 minutes,United States,"[English, Portuguese, Spanish]",42.0,,1135000.0,...,"$1,135,000 (worldwide rentals)","[Homer Brightman, William Cottrell, Richard Hu...",Fred Shields,,,,,,,
4,130,A Tale of Two Critters,Walt Disney Productions,"[June 22, 1977]",48 minutes,United States,English,48.0,,,...,,,Mayf Nutter,,G. Gregg McLaughlin,Jack Speirs,,,,
