## DISNEY MOVIES DATASET CREATION
Using python Beautifulsoup



In this notebook we scrape Wikipedia pages to create a dataset on Disney movies. 

the following steps are covered in the project:
- Web scraping with BeautifulSoup
- Cleaning data
- Pattern matching with regular expressions (Re library)
- Working with dates (datetime library)
- Saving & loading data with Pickle library
- Accessing data from an API using Requests library


First I will start with Data Collection process where I take data from multiple sources and build a disney_movie dataset.<br>
Then the data is stored as json format. Then I take that unstructured data and load it in CSV Format and start cleaning and preprocessing it.<br>Also, I Collected additional data such as IDB ratingg and rotten tomato score from OMDB API for every title and attached to the data.

Task #1: Scrape the infobox from Toy Story 3 wiki page (save in python dictionary) <br>
https://en.wikipedia.org/wiki/Toy_Story_3
Task #2: Scrape infobox for all movies in List of Disney Films (save as list of dictionaries)<br>
https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films

## Task #1: Scrape the infobox from Toy Story 3 wiki page : Practice<br>


In [1]:
#### Import Necessary Libraries
from bs4 import BeautifulSoup as bs
import requests

#### Load the web page

In [2]:
r = requests.get("https://en.wikipedia.org/wiki/Toy_Story_3")

# Convert to a beautiful soup object
soup = bs(r.content)

# Print out the HTML
contents = soup.prettify()
#print(contents)

The data we seek is in the infobox vevent class

In [3]:
info_box = soup.find(class_="infobox vevent")
#print(info_box.prettify())

Lets get all the table rows in this info box so that its easy to go through the rows

In [4]:
info_rows = info_box.find_all("tr")
for row in info_rows:
    print(row.prettify())

<tr>
 <th class="infobox-above summary" colspan="2" style="font-size: 125%; font-style: italic;">
  Toy Story 3
 </th>
</tr>

<tr>
 <td class="infobox-image" colspan="2">
  <a class="image" href="/wiki/File:Toy_Story_3_poster.jpg" title="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3.">
   <img alt="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3." class="thumbborder" data-file-height="326" data-file-width="220" decoding="async" height="326" src="//upload.wikimedia.org/wikipedia/en/6/69/Toy_Story_3_poster.jpg" width="220"/>
  </a>
  <div class="infobox-caption">
   Theatrical release poster
  </div>
 </td>
</tr>

<tr>
 <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;">
  Directed by
 </th>
 <td class="

In [5]:
movie_info_ = {}
for index, row in enumerate(info_rows):
    if index == 0:
        movie_info_['title'] = row.find("th").get_text()
    elif index == 1:
        continue
    else:
        content_key = row.find("th").get_text()
        content_value = row.find("td").get_text()
        movie_info_[content_key] = content_value
        
print(movie_info_)

{'title': 'Toy Story 3', 'Directed by': 'Lee Unkrich', 'Screenplay by': 'Michael Arndt', 'Story by': '\nJohn Lasseter\nAndrew Stanton\nLee Unkrich\n', 'Produced by': 'Darla K. Anderson', 'Starring': '\nTom Hanks\nTim Allen\nJoan Cusack\nDon Rickles\nWallace Shawn\nJohn Ratzenberger\nEstelle Harris\nNed Beatty\nMichael Keaton\nJodi Benson\nJohn Morris\n', 'Cinematography': '\nJeremy Lasky\nKim White\n', 'Edited by': 'Ken Schretzmann', 'Music by': 'Randy Newman', 'Productioncompanies': '\nWalt Disney Pictures\nPixar Animation Studios\n', 'Distributed by': 'Walt Disney StudiosMotion Pictures', 'Release dates': '\nJune\xa012,\xa02010\xa0(2010-06-12) (Taormina Film Fest)\nJune\xa018,\xa02010\xa0(2010-06-18) (United States)\n', 'Running time': '103 minutes[1]', 'Country': 'United States', 'Language': 'English', 'Budget': '$200\xa0million[1]', 'Box office': '$1.067\xa0billion[1]'}


in the above when there are multiple names we need to iterate through the list.<br>
we will separate the content value based on the list

In [6]:
def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

movie_info = {}
for index, row in enumerate(info_rows):
    if index == 0:
        movie_info['title'] = row.find("th").get_text(" ", strip=True)
    elif index == 1:
        continue
    else:
        content_key = row.find("th").get_text(" ", strip=True)
        content_value = get_content_value(row.find("td"))
        movie_info[content_key] = content_value
    
movie_info

{'title': 'Toy Story 3',
 'Directed by': 'Lee Unkrich',
 'Screenplay by': 'Michael Arndt',
 'Story by': ['John Lasseter', 'Andrew Stanton', 'Lee Unkrich'],
 'Produced by': 'Darla K. Anderson',
 'Starring': ['Tom Hanks',
  'Tim Allen',
  'Joan Cusack',
  'Don Rickles',
  'Wallace Shawn',
  'John Ratzenberger',
  'Estelle Harris',
  'Ned Beatty',
  'Michael Keaton',
  'Jodi Benson',
  'John Morris'],
 'Cinematography': ['Jeremy Lasky', 'Kim White'],
 'Edited by': 'Ken Schretzmann',
 'Music by': 'Randy Newman',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release dates': ['June 12, 2010 ( 2010-06-12 ) ( Taormina Film Fest )',
  'June 18, 2010 ( 2010-06-18 ) (United States)'],
 'Running time': '103 minutes [1]',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 million [1]',
 'Box office': '$1.067 billion [1]'}

## Task #2: Get info box for all movies

In [7]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")

# Convert to a beautiful soup object
soup = bs(r.content)

# Print out the HTML
contents = soup.prettify()
#print(contents)

In [8]:
movies = soup.select(".wikitable.sortable i")
movies[0:10]

[<i><a href="/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons" title="Academy Award Review of Walt Disney Cartoons">Academy Award Review of Walt Disney Cartoons</a></i>,
 <i><a href="/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)" title="Snow White and the Seven Dwarfs (1937 film)">Snow White and the Seven Dwarfs</a></i>,
 <i><a href="/wiki/Pinocchio_(1940_film)" title="Pinocchio (1940 film)">Pinocchio</a></i>,
 <i><a href="/wiki/Fantasia_(1940_film)" title="Fantasia (1940 film)">Fantasia</a></i>,
 <i><a href="/wiki/The_Reluctant_Dragon_(1941_film)" title="The Reluctant Dragon (1941 film)">The Reluctant Dragon</a></i>,
 <i><a href="/wiki/Dumbo" title="Dumbo">Dumbo</a></i>,
 <i><a href="/wiki/Bambi" title="Bambi">Bambi</a></i>,
 <i><a href="/wiki/Saludos_Amigos" title="Saludos Amigos">Saludos Amigos</a></i>,
 <i><a href="/wiki/Victory_Through_Air_Power_(film)" title="Victory Through Air Power (film)">Victory Through Air Power</a></i>,
 <i><a href="/wiki/The_Three_Caballeros" title=

In [9]:
movies[0].a['href']

'/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons'

In [10]:
movies[0].a['title']

'Academy Award Review of Walt Disney Cartoons'

tag.decompose() will remove the tags<br>
we can remove the sapan tag from the release dates

In [11]:
def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    elif row_data.find("br"):
        return [text for text in row_data.stripped_strings]
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

#to remove tags
def clean_tags(soup):
    for tag in soup.find_all(["sup", "span"]):
        tag.decompose()
        
def get_info_box(url):

    r = requests.get(url)
    soup = bs(r.content)
    info_box = soup.find(class_="infobox vevent")
    info_rows = info_box.find_all("tr")
    
    clean_tags(soup)

    movie_info = {}
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['title'] = row.find("th").get_text(" ", strip=True)
        else:
            header = row.find('th')#check only if table header then do the below operation
            if header:
                content_key = row.find("th").get_text(" ", strip=True)#only if there is header then add content
                content_value = get_content_value(row.find("td"))
                movie_info[content_key] = content_value
            
    return movie_info  

In [12]:
get_info_box("https://en.wikipedia.org/wiki/One_Little_Indian_(film)")

{'title': 'One Little Indian',
 'Directed by': 'Bernard McEveety',
 'Written by': 'Harry Spalding',
 'Produced by': 'Winston Hibler',
 'Starring': ['James Garner',
  'Vera Miles',
  'Pat Hingle',
  'Morgan Woodward',
  'Jodie Foster'],
 'Cinematography': 'Charles F. Wheeler',
 'Edited by': 'Robert Stafford',
 'Music by': 'Jerry Goldsmith',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': ['June 20, 1973'],
 'Running time': '90 Minutes',
 'Country': 'United States',
 'Language': 'English',
 'Box office': '$2 million'}

In [13]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")
soup = bs(r.content)
movies = soup.select(".wikitable.sortable i a")

base_path = "https://en.wikipedia.org/"

movie_info_list = []
for index, movie in enumerate(movies):
    if index % 10 == 0:
        print(index)
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path
        title = movie['title']
        
        movie_info_list.append(get_info_box(full_path))
        
    except Exception as e:
        print(movie.get_text())
        print(e)

0
10
20
30
40
Zorro the Avenger
'NoneType' object has no attribute 'find'
The Sign of Zorro
'NoneType' object has no attribute 'find'
50
60
70
80
90
100
110
120
130
140
The London Connection
'NoneType' object has no attribute 'find'
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
The Beatles: Get Back – The Rooftop Concert
'NoneType' object has no attribute 'find'
490
500
61
'NoneType' object has no attribute 'find_all'
All Night Long
'NoneType' object has no attribute 'find'
510
Keeper of the Lost Cities
'NoneType' object has no attribute 'find_all'
Muppet Man
'NoneType' object has no attribute 'find_all'
520
Sister Act 3
'NoneType' object has no attribute 'find'
The Thief
'NoneType' object has no attribute 'find_all'
Tom Sawyer
'NoneType' object has no attribute 'find_all'
530
Tower of Terror
'NoneType' object has no attribute 'find_all'
Tron: Ares
'NoneType' object has no attribute 'find'
FC Barc

In [14]:
len(movie_info_list)

519

### Save/Reload Movie Data
saving all dictionaries as Jason file

In [15]:
import json

def save_data(title, data):
    with open(title, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

In [16]:
import json

def load_data(title):
    with open(title, encoding="utf-8") as f:
        return json.load(f)

In [17]:
save_data("disney_movies_data_cleaned.json", movie_info_list)

## Task #3: Clean our data!

In [18]:
movie_info_list = load_data("disney_movies_data_cleaned.json")

#### Subtasks
- Clean up references [1]
- Convert running time into an integer
- Convert dates into datetime object
- Split up the long strings
- Convert Budget & Box office to numbers

In [19]:
movie_info_list[-40]

{'title': 'Jungle Cruise',
 'Directed by': 'Jaume Collet-Serra',
 'Screenplay by': ['Michael Green', 'Glenn Ficarra', 'John Requa'],
 'Story by': ['John Norville',
  'Josh Goldstein',
  'Glenn Ficarra',
  'John Requa'],
 'Based on': "Walt Disney 's Jungle Cruise",
 'Produced by': ['John Davis',
  'John Fox',
  'Beau Flynn',
  'Dwayne Johnson',
  'Dany Garcia',
  'Hiram Garcia'],
 'Starring': ['Dwayne Johnson',
  'Emily Blunt',
  'Édgar Ramírez',
  'Jack Whitehall',
  'Jesse Plemons',
  'Paul Giamatti'],
 'Cinematography': 'Flavio Labiano',
 'Edited by': 'Joel Negron',
 'Music by': 'James Newton Howard',
 'Production companies': ['Walt Disney Pictures',
  'Davis Entertainment',
  'Seven Bucks Productions',
  'Flynn Picture Company'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release dates': ['July 24, 2021 ( Disneyland Resort )',
  'July 30, 2021 (United States)'],
 'Running time': '128 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 mil

In [20]:
print([movie.get('Running time', 'N/A') for movie in movie_info_list])

['41 minutes (74 minutes 1966 release)', '83 minutes', '88 minutes', '126 minutes', '74 minutes', '64 minutes', '70 minutes', '42 minutes', '70 min', '71 minutes', '75 minutes', '94 minutes', '73 minutes', '75 minutes', '82 minutes', '68 minutes', '74 minutes', '96 minutes', '75 minutes', '84 minutes', '77 minutes', '92 minutes', '69 minutes', '81 minutes', ['60 minutes (VHS version)', '71 minutes (original)'], '127 minutes', '192 minutes', '76 minutes', '75 minutes', '73 minutes', '85 minutes', '81 minutes', '70 minutes', '90 min.', '80 minutes', '75 minutes', '83 minutes', '83 minutes', '72 minutes', '97 minutes', '75 minutes', '104 minutes', '93 minutes', '105 minutes', '95 minutes', '97 minutes', '134 minutes', '69 minutes', '92 minutes', '126 minutes', '79 minutes', '97 minutes', '128 minutes', '73 minutes', '91 minutes', '105 minutes', '98 minutes', '130 minutes', '89 minutes', '93 minutes', '67 minutes', '98 minutes', '100 minutes', '118 minutes', '103 minutes', '110 minutes', '

In [21]:
# "85 minutes", '41 minutes (74 minutes 1966 release)','N/A', ['468 minutes'], '70 min'
def minutes_to_integer(running_time):
    if running_time == "N/A":
        return None
    
    if isinstance(running_time, list):
        return int(running_time[0].split(" ")[0])
    else: # is a string
        return int(running_time.split(" ")[0])

for movie in movie_info_list:
    movie['Running time (int)'] = minutes_to_integer(movie.get('Running time', "N/A"))

In [22]:
print([movie.get('Running time (int)', 'N/A') for movie in movie_info_list])

[41, 83, 88, 126, 74, 64, 70, 42, 70, 71, 75, 94, 73, 75, 82, 68, 74, 96, 75, 84, 77, 92, 69, 81, 60, 127, 192, 76, 75, 73, 85, 81, 70, 90, 80, 75, 83, 83, 72, 97, 75, 104, 93, 105, 95, 97, 134, 69, 92, 126, 79, 97, 128, 73, 91, 105, 98, 130, 89, 93, 67, 98, 100, 118, 103, 110, 80, 74, 91, 91, 97, 118, 139, 131, 92, 87, 116, 93, 110, 110, 131, 101, 108, 84, 78, 75, 164, 106, 110, 99, 113, 108, 112, 93, 91, 93, 100, 100, 79, 96, 113, 89, 117, 92, 88, 92, 87, 93, 93, 93, 90, 83, 96, 88, 89, 91, 93, 92, 97, 100, 100, 89, None, 91, 112, 115, 95, 91, 97, 104, 74, 48, 77, 104, 128, 101, 94, 104, 90, 100, 88, 93, 98, 112, 84, 97, 97, 114, 96, 97, 109, 83, 90, 107, 96, 103, 91, 95, 105, 113, 80, 101, 90, 74, 90, 89, 110, 74, 93, 84, 83, 74, 77, 107, 93, 88, 108, 84, 121, 89, 104, 90, 86, 84, 108, 107, 96, 98, 105, 108, 94, 106, 102, 88, 102, 102, 97, 111, 100, 96, 98, 78, 81, 108, 89, 99, 89, 81, 92, 100, 89, 79, 91, 101, 104, 103, 86, 105, 75, 93, 92, 98, 95, 93, 87, 93, 87, 128, 77, 86, 95, 

In [23]:
print([movie.get('Budget', 'N/A') for movie in movie_info_list])

['N/A', '$1.49 million', '$2.6 million', '$2.28 million', '$600,000', '$950,000', '$858,000', 'N/A', '$788,000', 'N/A', '$1.35 million', '$2.125 million', 'N/A', '$1.5 million', '$1.5 million', 'N/A', '$2.2 million', '$1,800,000', '$3 million', 'N/A', '$4 million', '$2 million', '$300,000', '$1.8 million', 'N/A', '$5 million', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$700,000', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$6 million', 'under $1 million or $1,250,000', 'N/A', '$2 million', 'N/A', 'N/A', '$2.5 million', 'N/A', 'N/A', '$4 million', '$3.6 million', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', '$4.4–6 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', '

In [24]:
import re

amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"

word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})"
value_re = rf"\${number}"

def word_to_value(word):
    value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value*word_value

def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    return value

'''
money_conversion("$12.2 million") --> 12200000 ## Word syntax
money_conversion("$790,000") --> 790000        ## Value syntax
'''
def money_conversion(money):
    if money == "N/A":
        return None

    if isinstance(money, list):
        money = money[0]
        
    word_syntax = re.search(word_re, money, flags=re.I)
    value_syntax = re.search(value_re, money)

    if word_syntax:
        return parse_word_syntax(word_syntax.group())

    elif value_syntax:
        return parse_value_syntax(value_syntax.group())

    else:
        return None

In [25]:
for movie in movie_info_list:
    movie['Budget (float)'] = money_conversion(movie.get('Budget', "N/A"))
    movie['Box office (float)'] = money_conversion(movie.get('Box office', "N/A"))

In [26]:
movie_info_list[-40]

{'title': 'Jungle Cruise',
 'Directed by': 'Jaume Collet-Serra',
 'Screenplay by': ['Michael Green', 'Glenn Ficarra', 'John Requa'],
 'Story by': ['John Norville',
  'Josh Goldstein',
  'Glenn Ficarra',
  'John Requa'],
 'Based on': "Walt Disney 's Jungle Cruise",
 'Produced by': ['John Davis',
  'John Fox',
  'Beau Flynn',
  'Dwayne Johnson',
  'Dany Garcia',
  'Hiram Garcia'],
 'Starring': ['Dwayne Johnson',
  'Emily Blunt',
  'Édgar Ramírez',
  'Jack Whitehall',
  'Jesse Plemons',
  'Paul Giamatti'],
 'Cinematography': 'Flavio Labiano',
 'Edited by': 'Joel Negron',
 'Music by': 'James Newton Howard',
 'Production companies': ['Walt Disney Pictures',
  'Davis Entertainment',
  'Seven Bucks Productions',
  'Flynn Picture Company'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release dates': ['July 24, 2021 ( Disneyland Resort )',
  'July 30, 2021 (United States)'],
 'Running time': '128 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 mil

In [27]:
money_conversion(str(movie_info_list[-40]["Budget"]))

200000000.0

In [28]:
# Convert Dates into datetimes
print([movie.get('Release date', 'N/A') for movie in movie_info_list])

[['May 19, 1937'], 'N/A', 'N/A', ['November 13, 1940'], ['June 27, 1941'], 'N/A', 'N/A', 'N/A', ['July 17, 1943'], 'N/A', 'N/A', 'N/A', ['September 27, 1947'], 'May 27, 1948', 'N/A', ['October 5, 1949'], 'N/A', 'N/A', 'N/A', 'N/A', ['February 5, 1953 (United States)'], ['July 23, 1953 (US)'], ['November 10, 1953'], 'N/A', ['August 17, 1954'], ['December 23, 1954'], 'May 25, 1955', ['June 22, 1955'], ['September 14, 1955'], 'December 22, 1955', 'June 8, 1956', 'July 18, 1956', ['September 4, 1956'], ['December 20, 1956'], 'June 19, 1957', 'August 28, 1957', ['December 25, 1957'], ['July 8, 1958'], ['August 12, 1958'], ['December 25, 1958'], ['January 29, 1959'], ['March 19, 1959'], 'N/A', ['November 10, 1959'], 'January 21, 1960 ( Sarasota, FL )', ['February 24, 1960'], 'May 19, 1960', 'N/A', ['November 1, 1960'], ['December 21, 1960'], ['January 25, 1961'], 'March 16, 1961', ['June 21, 1961'], ['July 12, 1961'], ['July 17, 1961'], ['December 14, 1961'], 'April 5, 1962', ['May 17, 1962'

In [29]:
movie_info_list[-50]

{'title': 'The One and Only Ivan',
 'Directed by': 'Thea Sharrock',
 'Screenplay by': 'Mike White',
 'Based on': ['The One and Only Ivan', 'by', 'K. A. Applegate'],
 'Produced by': ['Angelina Jolie', 'Allison Shearmur', 'Brigham Taylor'],
 'Starring': ['Sam Rockwell',
  'Angelina Jolie',
  'Danny DeVito',
  'Helen Mirren',
  'Ramón Rodríguez',
  'Ariana Greenblatt',
  'Bryan Cranston'],
 'Cinematography': 'Florian Ballhaus',
 'Edited by': 'Barney Pilling',
 'Music by': 'Craig Armstrong',
 'Production companies': ['Walt Disney Pictures', 'Jolie Pas Productions'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release date': ['August 21, 2020 (United States)'],
 'Running time': '95 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Running time (int)': 95,
 'Budget (float)': None,
 'Box office (float)': None}

In [30]:
# June 28, 1950
from datetime import datetime

dates = [movie.get('Release date', 'N/A') for movie in movie_info_list]

def clean_date(date):
    return date.split("(")[0].strip()

def date_conversion(date):
    if isinstance(date, list):
        date = date[0]
        
    if date == "N/A":
        return None
        
    date_str = clean_date(date)

    fmts = ["%B %d, %Y", "%d %B %Y"]
    for fmt in fmts:
        try:
            return datetime.strptime(date_str, fmt)
        except:
            pass
    return None


In [31]:
for movie in movie_info_list:
    movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A'))

In [32]:
movie_info_list[50]

{'title': '101 Dalmatians',
 'Directed by': ['Clyde Geronimi', 'Hamilton Luske', 'Wolfgang Reitherman'],
 'Story by': 'Bill Peet',
 'Based on': ['The Hundred and One Dalmatians', 'by', 'Dodie Smith'],
 'Produced by': 'Walt Disney',
 'Starring': ['Rod Taylor',
  'Cate Bauer',
  'Betty Lou Gerson',
  'Ben Wright',
  'Bill Lee (singing voice)',
  'Lisa Davis',
  'Martha Wentworth'],
 'Edited by': ['Roy M. Brewer, Jr.', 'Donald Halliday'],
 'Music by': 'George Bruns',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': ['January 25, 1961'],
 'Running time': '79 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$3.6 million',
 'Box office': '$303 million',
 'Running time (int)': 79,
 'Budget (float)': 3600000.0,
 'Box office (float)': 303000000.0,
 'Release date (datetime)': datetime.datetime(1961, 1, 25, 0, 0)}

In [36]:
import pickle

def save_data_pickle(name, data):
    with open(name, 'wb') as f:
        pickle.dump(data, f)

In [37]:
def load_data_pickle(name):
    with open(name, 'rb') as f:
        return pickle.load(f)

In [38]:
save_data_pickle("disney_movie_data_cleaned_more.pickle", movie_info_list)

In [39]:
a = load_data_pickle("disney_movie_data_cleaned_more.pickle")

In [40]:
a == movie_info_list

True

## Task #4: Attach IMDB/Rotten Tomatoes/Metascore scores

In [41]:
movie_info_list = load_data_pickle('disney_movie_data_cleaned_more.pickle')

In [43]:
movie_info_list[-50]

{'title': 'The One and Only Ivan',
 'Directed by': 'Thea Sharrock',
 'Screenplay by': 'Mike White',
 'Based on': ['The One and Only Ivan', 'by', 'K. A. Applegate'],
 'Produced by': ['Angelina Jolie', 'Allison Shearmur', 'Brigham Taylor'],
 'Starring': ['Sam Rockwell',
  'Angelina Jolie',
  'Danny DeVito',
  'Helen Mirren',
  'Ramón Rodríguez',
  'Ariana Greenblatt',
  'Bryan Cranston'],
 'Cinematography': 'Florian Ballhaus',
 'Edited by': 'Barney Pilling',
 'Music by': 'Craig Armstrong',
 'Production companies': ['Walt Disney Pictures', 'Jolie Pas Productions'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release date': ['August 21, 2020 (United States)'],
 'Running time': '95 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Running time (int)': 95,
 'Budget (float)': None,
 'Box office (float)': None,
 'Release date (datetime)': datetime.datetime(2020, 8, 21, 0, 0)}

### open OMDb API
The Open Movie Database<br>

The OMDb API is a RESTful web service to obtain movie information, all content and images on the site are contributed and maintained by our users.<br>

https://www.omdbapi.com/

In [48]:
# http://www.omdbapi.com/?apikey=[yourkey]&

In [47]:
import requests
import urllib
import os

def get_omdb_info(title):
    base_url = "http://www.omdbapi.com/?"
    parameters = {"apikey": os.environ['OMDB_API_KEY'], 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

def get_rotten_tomato_score(omdb_info):
    ratings = omdb_info.get('Ratings', [])
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None

get_omdb_info("The One and Only Ivan")

{'Title': 'The One and Only Ivan',
 'Year': '2020',
 'Rated': 'PG',
 'Released': '21 Aug 2020',
 'Runtime': '95 min',
 'Genre': 'Adventure, Comedy, Drama',
 'Director': 'Thea Sharrock',
 'Writer': 'Mike White, Katherine Applegate',
 'Actors': 'Sam Rockwell, Bryan Cranston, Phillipa Soo',
 'Plot': 'A gorilla named Ivan tries to piece together his past with the help of an elephant named Ruby as they hatch a plan to escape from captivity.',
 'Language': 'English',
 'Country': 'United States',
 'Awards': 'Nominated for 1 Oscar. 2 wins & 4 nominations total',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BZWY3OTNhNWUtMDk2My00ZGVhLWE5ODQtM2NkOTZiMWM2MGY2XkEyXkFqcGdeQXVyNjMwMzc3MjE@._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '6.6/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '71%'},
  {'Source': 'Metacritic', 'Value': '58/100'}],
 'Metascore': '58',
 'imdbRating': '6.6',
 'imdbVotes': '11,829',
 'imdbID': 'tt3661394',
 'Type': 'movie',
 'DVD': '21 Aug 20

In [49]:
for movie in movie_info_list:
    title = movie['title']
    omdb_info = get_omdb_info(title)
    movie['imdb'] = omdb_info.get('imdbRating', None)
    movie['metascore'] = omdb_info.get('Metascore', None)
    movie['rotten_tomatoes'] = get_rotten_tomato_score(omdb_info)

In [51]:
movie_info_list[-50]

{'title': 'The One and Only Ivan',
 'Directed by': 'Thea Sharrock',
 'Screenplay by': 'Mike White',
 'Based on': ['The One and Only Ivan', 'by', 'K. A. Applegate'],
 'Produced by': ['Angelina Jolie', 'Allison Shearmur', 'Brigham Taylor'],
 'Starring': ['Sam Rockwell',
  'Angelina Jolie',
  'Danny DeVito',
  'Helen Mirren',
  'Ramón Rodríguez',
  'Ariana Greenblatt',
  'Bryan Cranston'],
 'Cinematography': 'Florian Ballhaus',
 'Edited by': 'Barney Pilling',
 'Music by': 'Craig Armstrong',
 'Production companies': ['Walt Disney Pictures', 'Jolie Pas Productions'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release date': ['August 21, 2020 (United States)'],
 'Running time': '95 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Running time (int)': 95,
 'Budget (float)': None,
 'Box office (float)': None,
 'Release date (datetime)': datetime.datetime(2020, 8, 21, 0, 0),
 'imdb': '6.6',
 'metascore': '58',
 'rotten_tomatoes': '71%'}

In [52]:
save_data_pickle('disney_movie_data_final.pickle', movie_info_list)

## Task #5: Save data as JSON & CSV

In [54]:
movie_info_list[100]

{'title': 'Scandalous John',
 'Directed by': 'Robert Butler',
 'Written by': 'Bill Walsh',
 'Produced by': 'Bill Walsh',
 'Starring': ['Brian Keith', 'Alfonso Arau', 'Michele Carey'],
 'Cinematography': 'Frank V. Phillips',
 'Edited by': 'Cotton Warburton',
 'Music by': 'Rod McKuen',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': 'June 22, 1971',
 'Running time': '113 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Running time (int)': 113,
 'Budget (float)': None,
 'Box office (float)': None,
 'Release date (datetime)': datetime.datetime(1971, 6, 22, 0, 0),
 'imdb': '5.8',
 'metascore': 'N/A',
 'rotten_tomatoes': '20%'}

In [55]:
movie_info_copy = [movie.copy() for movie in movie_info_list]

In [56]:
for movie in movie_info_copy:
    current_date = movie['Release date (datetime)']
    if current_date:
        movie['Release date (datetime)'] = current_date.strftime("%B %d, %Y")
    else:
        movie['Release date (datetime)'] = None

In [57]:
save_data("disney_data_final.json", movie_info_copy)

### Convert data to CSV

In [60]:
import pandas as pd
disney_movie_data = pd.DataFrame(movie_info_list)

In [61]:
disney_movie_data.head()

Unnamed: 0,title,Production company,Distributed by,Release date,Running time,Country,Language,Box office,Running time (int),Budget (float),...,Original concept by,Created by,Original work,Owner,Music,Lyrics,Book,Basis,Productions,Awards
0,Academy Award Review of,Walt Disney Productions,United Artists,"[May 19, 1937]",41 minutes (74 minutes 1966 release),United States,English,$45.472,41.0,,...,,,,,,,,,,
1,Snow White and the Seven Dwarfs,Walt Disney Productions,RKO Radio Pictures,,83 minutes,United States,English,$418 million,83.0,1490000.0,...,,,,,,,,,,
2,Pinocchio,Walt Disney Productions,RKO Radio Pictures,,88 minutes,United States,English,$164 million,88.0,2600000.0,...,,,,,,,,,,
3,Fantasia,Walt Disney Productions,RKO Radio Pictures,"[November 13, 1940]",126 minutes,United States,English,$76.4–$83.3 million (United States and Canada),126.0,2280000.0,...,,,,,,,,,,
4,The Reluctant Dragon,Walt Disney Productions,RKO Radio Pictures,"[June 27, 1941]",74 minutes,United States,English,"$960,000 (worldwide rentals)",74.0,600000.0,...,,,,,,,,,,


In [65]:
disney_movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 519 entries, 0 to 518
Data columns (total 50 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   title                    519 non-null    object        
 1   Production company       214 non-null    object        
 2   Distributed by           517 non-null    object        
 3   Release date             339 non-null    object        
 4   Running time             496 non-null    object        
 5   Country                  464 non-null    object        
 6   Language                 498 non-null    object        
 7   Box office               401 non-null    object        
 8   Running time (int)       496 non-null    float64       
 9   Budget (float)           307 non-null    float64       
 10  Box office (float)       389 non-null    float64       
 11  Release date (datetime)  332 non-null    datetime64[ns]
 12  imdb                     501 non-nul

In [62]:
disney_movie_data.to_csv("disney_movie_data_final.csv")

In [63]:
running_times = disney_movie_data.sort_values(['Running time (int)'],  ascending=False)
running_times.head(10)

Unnamed: 0,title,Production company,Distributed by,Release date,Running time,Country,Language,Box office,Running time (int),Budget (float),...,Original concept by,Created by,Original work,Owner,Music,Lyrics,Book,Basis,Productions,Awards
517,Tinker Bell,DisneyToon Studios,"[Walt Disney Studios, Home Entertainment]",,[468 minutes],United States,English,,468.0,,...,,,,,,,,,,
26,Davy Crockett: King of the Wild Frontier,Walt Disney Productions,"Buena Vista Film Distribution Co., Inc.","May 25, 1955",192 minutes,United States,English,$50 million (US),192.0,,...,,,,,,,,,,
328,Pirates of the Caribbean: At World's End,,Buena Vista Pictures,,167 minutes,United States,English,$960.9 million,167.0,300000000.0,...,,,,,,,,,,
86,The Happiest Millionaire,Walt Disney Productions,Buena Vista Distribution,,"[164 minutes, (, Los Angeles, premiere), 144 m...",United States,English,$5 million (U.S./Canada rentals),164.0,5000000.0,...,,,,,,,,,,
441,Jagga Jasoos,,UTV Motion Pictures,[14 July 2017],162 minutes,India,Hindi,83 crore,162.0,,...,,,,,,,,,,
434,Dangal,,UTV Motion Pictures,,161 minutes,India,Hindi,est. est.,161.0,,...,,,,,,,,,,
466,Hamilton,,Walt Disney Studios Motion Pictures,"[July 3, 2020]",160 minutes,United States,English,,160.0,12500000.0,...,,,,,,,,,,
422,ABCD 2,Walt Disney Pictures,UTV Motion Pictures,[19 June 2015],154 minutes,India,Hindi,est.,154.0,,...,,,,,,,,,,
319,Pirates of the Caribbean: Dead Man's Chest,,Buena Vista Pictures,,150 minutes,United States,English,$1.066 billion,150.0,225000000.0,...,,,,,,,,,,
338,The Chronicles of Narnia: Prince Caspian,,Walt Disney Studios Motion Pictures,,150 minutes,,English,$419.7 million,150.0,225000000.0,...,,,,,,,,,,


[References1!..](https://www.youtube.com/watch?v=Ewgy-G9cmbg&list=WL&index=64)<br>
[References2!..](https://medium.com/analytics-vidhya/portfolio-project-1-disney-movie-anaysis-12190297d1fe)<br>