# |   Movie Analysis   |
 ***


Disney Dataset Creation using Python BeautifulSoup.
In this notebook data is scraped and cleaned, to create a list of Disney Film Wikipedia pages producing a dataset to further analyze.

In this repo I scrape Wikipedia pages to create a dataset on Disney Corporation Movies. I cover a wide range of Python & data science topics in this repo:
* Web scraping with BeautifulSoup
- Cleaning data
- Testing code with Pytest
- Pattern matching with regular expressions (Re library)
- Working with dates (datetime library)
- Saving & loading data with Pickle library
- Accessing data from an API using Requests library.
***

## Import Libraries
The libaries used in this notebook are loaded.

In [1]:
from bs4 import BeautifulSoup as bs

import requests
import json
import re
import pickle
import requests
import urllib
import os
import pandas as pd

## Obtaining relevant film data from webpage.
To create the data set we must copy the relevant information from the webpages and proceed with cleaning the data.In this section I will obtain the data from a single page to test and optimize the code, after we proceede with the whole filmography.

***
### TASK I: ACCQUIRE DATA FOR ONE FILM.

### Loading raw page data.
The wikipedia page is loaded with all raw information.

In [2]:
filmPage = requests.get("https://en.wikipedia.org/wiki/High_School_Musical_3:_Senior_Year")
pageContent = bs(filmPage.content)    #### convert to a bs4 object
contents = pageContent.prettify()

print(contents)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   High School Musical 3: Senior Year - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"X-4OFQpAICwAAIPJoeEAAAAF","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"High_School_Musical_3:_Senior_Year","wgTitle":"High School Musical 3: Senior Year","wgCurRevisionId":999945808,"wgRevisionId":999945808,"wgArticleId":9391085,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","Wikipedia indefinitely move-protected pages","Template film date

##### Code Explaintion:

The raw information is copied onto the "filmPage" using 'requests.get()', after the page content is converted to a bs4 object and saved in "pageContent". In conclusion, for legibility we use the function '.prettify()' before printing the page information.

### Obtaining relevant film data.
After exploring the raw data and the original webpage, we identify the section of the raw code in which the applicable information is located. 

In [3]:
infoBox = pageContent.find(class_="infobox vevent") #information box
#print(infoBox.prettify())

infoRows = infoBox.find_all("tr") #raw row data
for row in infoRows:
    print(row.prettify())


<tr>
 <th class="summary" colspan="2" style="text-align:center;font-size:125%;font-weight:bold;font-size:110%;font-style:italic;">
  High School Musical 3:
  <br/>
  Senior Year
 </th>
</tr>

<tr>
 <td colspan="2" style="text-align:center">
  <a class="image" href="/wiki/File:HSM_3_Poster.JPG" title="The six main cast members do their signature jump, this time in prom outfits and graduation gowns">
   <img alt="The six main cast members do their signature jump, this time in prom outfits and graduation gowns" class="thumbborder" data-file-height="370" data-file-width="250" decoding="async" height="326" src="//upload.wikimedia.org/wikipedia/en/thumb/a/af/HSM_3_Poster.JPG/220px-HSM_3_Poster.JPG" srcset="//upload.wikimedia.org/wikipedia/en/a/af/HSM_3_Poster.JPG 1.5x" width="220"/>
  </a>
  <div style="font-size:95%;padding:0.35em 0.35em 0.25em;line-height:1.25em;">
   Theatrical release poster
  </div>
 </td>
</tr>

<tr>
 <th scope="row" style="white-space:nowrap;padding-right:0.65em;">
  

##### Code Explaination:
Using the raw code, we search and locate the data in the "information box" section of the filmography web page, specifically the "class = 'infobox vevent' ". This information is saved onto 'infoBox' using the function pageContent.find(). Subsequently we determine and find that "< /tr>" is where all the data points are saved at. Using the function infoBox.find_all() we will be saving all the rows with the film data. Later the rows infoRows are printed and displayed.

### Extraction of the information on to a dictionary.

In [4]:
movieInfo = {}
def row_Content_List(rowDataRaw):
    if rowDataRaw.find("li"):
        return [li.get_text(" ", strip = True).replace("\xa0", " ") for li in rowDataRaw.find_all("li")]
    else:
        return rowDataRaw.get_text(" ", strip = True).replace("\xa0", " ")

for i, row in enumerate(infoRows):
    if i == 0:
        movieInfo['Title'] = row.find("th").get_text(" ", strip = True)
    elif i == 1:
        continue
    else:
        rowKey = row.find("th").get_text(" ", strip = True)
        rowValue = row_Content_List(row.find("td"))
        movieInfo[rowKey] = rowValue
    
movieInfo

{'Title': 'High School Musical 3: Senior Year',
 'Directed by': 'Kenny Ortega',
 'Produced by': ['Bill Borden', 'Barry Rosenbush'],
 'Written by': 'Peter Barsocchini',
 'Starring': ['Zac Efron',
  'Vanessa Hudgens',
  'Ashley Tisdale',
  'Lucas Grabeel',
  'Corbin Bleu',
  'Monique Coleman'],
 'Music by': 'David Lawrence',
 'Cinematography': 'Daniel Aranyò',
 'Edited by': 'Don Brochu',
 'Production company': ['Walt Disney Pictures',
  'Borden & Rosenbush Entertainment'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release date': ['October 17, 2008 ( 2008-10-17 ) (London)',
  'October 24, 2008 ( 2008-10-24 ) (United States)'],
 'Running time': '111 minutes (theatrical) [1] 120 minutes (extended/Disney+)',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$11 million [2]',
 'Box office': '$252.9 million [2]'}

##### Code Explanation:
A new dictionary "movieInfo" is created. Next, a function "row_Content_List()" is defined, where the argument is the uncleaned row data "rawRowData. The function locates all < li > elements in the rowRawData and returns the cleaned text. The data must be clean because certain spaces ' ' are expressed as '\xa0'.

In the second section of the code the function is executed in a conditional for cycle. In the first iteration of the cycle the Title of the film and the text related to that specific row is located and saved in the dictionary using 'row.find("th").get_text(" ", strip = True)' . In the subsequent iterations the same process is executed individually: 
1. the row title (row key text) is found using row.find("th").get_text(" ", strip = True) and saved in rowKey.
2. the value associated with the rowKey is found with row_Content_List(row.find("td")) and saved onto rowValue.
3. the info is saved on to the dictionary in the following format: {'rowKey': 'rowValue',}


 ***
### TASK II: ACCQUIRE DATA FOR ALL FILMS. 
Having successfully captured the relevant data from one film, we proceed to secure the data for all the movies in the filmography.
### Loading raw page data.

In [5]:
filmography = requests.get ('https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films')
allContent = bs(filmography.content) # Convert to a beautiful soup object

contents = allContent.prettify() # Print out the HTML
print(contents)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Walt Disney Pictures films - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"X-tPqApAICoAAC3peioAAACI","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Walt_Disney_Pictures_films","wgTitle":"List of Walt Disney Pictures films","wgCurRevisionId":998678114,"wgRevisionId":998678114,"wgArticleId":1970335,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","CS1 maint: archived copy as title","Articles with short descript

Create list of URL and list for all films

In [6]:
allFilms = allContent.select(".wikitable.sortable i")
allFilms[0:1]

[<i><a href="/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons" title="Academy Award Review of Walt Disney Cartoons">Academy Award Review of Walt Disney Cartoons</a></i>]

 ### Definition of functions to obtain data for all films.

In [7]:
def row_Content_All(rawRowData):
    if rawRowData.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in rawRowData.find_all("li")]
    elif rawRowData.find("br"):
        return [text for text in rawRowData.stripped_strings]
    else:
        return rawRowData.get_text(" ", strip=True).replace("\xa0", " ")

def clean_tags(allContent):
    for tag in allContent.find_all(["sup", "span"]):
        tag.decompose()
        
def allInfoBox(url):
    filmPage = requests.get(url)
    pageContent = bs(filmPage.content)
    infoBox = pageContent.find(class_="infobox vevent")
    infoRows = infoBox.find_all("tr") 
    clean_tags(pageContent)
    
    filmInfo = {}
    
    for i, row in enumerate(infoRows):
        if i == 0:
            filmInfo['title'] = row.find("th").get_text(" ", strip=True)
        else:
            header = row.find('th')
            if header:
                rowKey = row.find("th").get_text(" ", strip=True)
                rowValue = row_Content_All(row.find("td"))
                filmInfo[rowKey] = rowValue
            
    return filmInfo    

##### Code Explanation:
We will define the functions that are necessary to get all the data for films. We base the functions from the section above.
1. The function "row_Content_All" gets the clean text from the rawRowData.
2. "clean_tags" function gets and decomposes the all the "sup" and "span" elements from raw text extracted.
3. The function allInfoBox requests and loads all the page info from wikipedia on to dictionary. Then using clean_Tags removes the useless text from the raw text. Dict  "filmInfo" is created to store all our information. Then for all the rows with information, the text is stripped, separated, and organized in the following format: {'title': "movie", 'rowKey': 'rowValue'}


##### Test:

In [8]:
allInfoBox('https://en.wikipedia.org/wiki/Fantasia_(1940_film)')

{'title': 'Fantasia',
 'Directed by': ['Samuel Armstrong',
  'James Algar',
  'Bill Roberts',
  'Paul Satterfield',
  'Ben Sharpsteen',
  'David D. Hand',
  'Hamilton Luske',
  'Jim Handley',
  'Ford Beebe',
  'T. Hee',
  'Norman Ferguson',
  'Wilfred Jackson'],
 'Produced by': ['Walt Disney', 'Ben Sharpsteen'],
 'Story by': ['Joe Grant', 'Dick Huemer'],
 'Starring': ['Leopold Stokowski', 'Deems Taylor'],
 'Narrated by': 'Deems Taylor',
 'Music by': 'See program',
 'Cinematography': 'James Wong Howe',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release date': ['November 13, 1940'],
 'Running time': '126 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$2.28 million',
 'Box office': '$76.4–$83.3 million'}

### Extraction of the information on to a list of dictionaries.
Once all functions are defined we execute them for all films.

In [9]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")
soup = bs(r.content)
films = soup.select(".wikitable.sortable i a")

wikiPath = "https://en.wikipedia.org/"
filmInfoList = []

for i, film in enumerate(films):
    if i % 10 == 0:
        print(i)
    try:
        moviePath = film['href']
        completeUrl = wikiPath + moviePath
        title = film['title']
        
        filmInfoList.append(allInfoBox(completeUrl))
        
    except Exception as e:
        print(film.get_text())
        print(e)

0
10
20
30
40
Zorro the Avenger
'NoneType' object has no attribute 'find'
The Sign of Zorro
'NoneType' object has no attribute 'find'
50
60
70
80
90
100
110
120
True-Life Adventures
'NoneType' object has no attribute 'find_all'
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
Luca
'NoneType' object has no attribute 'find_all'
440


##### Code Explanation: 
We get the urls for all movies and proceede to execute allInfoBox(url) for all urls and save all data in "filmInfoList".

We review our data

In [10]:
filmInfoList[0]

{'title': 'Academy Award Review of',
 'Production company': 'Walt Disney Productions',
 'Release date': ['May 19, 1937'],
 'Running time': '41 minutes (74 minutes 1966 release)',
 'Country': 'United States',
 'Language': 'English',
 'Box office': '$45.472'}

***
## TASK III: Save/Reload Movie Data
Finally, we save the dataset.

In [11]:
def saveData(title, data):
    with open(title, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

In [12]:
def loadData(title):
    with open(title, encoding="utf-8") as f:
        return json.load(f)

In [13]:
saveData("disney_data.json", filmInfoList)

***
## TASK IV: Cleaning Data
We now have a workable dataset. Nevertheless it's very hard to do meaningfull analysis with the dataset in it's current state.


In [14]:
filmInfoList = loadData("disney_data.json")

### Exploring Data

In [15]:
filmInfoList[-40]

{'title': 'Incredibles 2',
 'Directed by': 'Brad Bird',
 'Produced by': ['John Walker', 'Nicole Paradis Grindle'],
 'Written by': 'Brad Bird',
 'Starring': ['Craig T. Nelson',
  'Holly Hunter',
  'Sarah Vowell',
  'Huckleberry Milner',
  'Samuel L. Jackson'],
 'Music by': 'Michael Giacchino',
 'Cinematography': ['Mahyar Abousaeedi', 'Erik Smitt'],
 'Edited by': 'Stephen Schaffer',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release date': ['June 5, 2018 ( Los Angeles )',
  'June 15, 2018 (United States)'],
 'Running time': '118 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$200 million',
 'Box office': '$1.243 billion'}

### Subtasks
By exploring the data we can clearly see 5 problems to solve.
- Clean up references
- Convert running time into an integer
- Convert dates into datetime object
- Split up the long strings
- Convert Budget & Box office to numbers

### Convert running time into an integer

In [16]:
print([movie.get('Running time', 'N/A') for movie in filmInfoList]) ##View Running Time Data

['41 minutes (74 minutes 1966 release)', '83 minutes', '88 minutes', '126 minutes', '74 minutes', '64 minutes', '70 minutes', '42 minutes', '65 min.', '71 minutes', '75 minutes', '94 minutes', '73 minutes', '75 minutes', '82 minutes', '68 minutes', '74 minutes', '96 minutes', '75 minutes', '84 minutes', '77 minutes', '92 minutes', '69 minutes', '81 minutes', ['60 minutes (VHS version)', '71 minutes (original)'], '127 minutes', '92 minutes', '76 minutes', '75 minutes', '73 minutes', '85 minutes', '81 minutes', '70 minutes', '90 min.', '80 minutes', '75 minutes', '83 minutes', '83 minutes', '72 minutes', '97 minutes', '75 minutes', '104 minutes', '93 minutes', '105 minutes', '95 minutes', '97 minutes', '134 minutes', '69 minutes', '92 minutes', '126 minutes', '79 minutes', '97 minutes', '128 minutes', '74 minutes', '91 minutes', '105 minutes', '98 minutes', '130 minutes', '89 min.', '93 minutes', '67 minutes', '98 minutes', '100 minutes', '118 minutes', '103 Minutes', '110 minutes', '80 

In [17]:
# Separation and convertion
def minutes_to_integer(running_time):
    if running_time == "N/A":
        return None
    
    if isinstance(running_time, list):
        return int(running_time[0].split(" ")[0])
    else: # is a string
        return int(running_time.split(" ")[0])

for movie in filmInfoList:
    movie['Running time (int)'] = minutes_to_integer(movie.get('Running time', "N/A"))

##### Code Explanation: 
We define a function that accepts running_time as a argument, and returns the integer of the first element for every element of the running_time list. We split the element in order to remove "minutes" from every relevant element in the list.

We execute this function for every "Running time" section for every movie in the list filmInfoList.

In [18]:
print([movie.get('Running time (int)', 'N/A') for movie in filmInfoList])
#We print the nested Running time list now after the conversion.

[41, 83, 88, 126, 74, 64, 70, 42, 65, 71, 75, 94, 73, 75, 82, 68, 74, 96, 75, 84, 77, 92, 69, 81, 60, 127, 92, 76, 75, 73, 85, 81, 70, 90, 80, 75, 83, 83, 72, 97, 75, 104, 93, 105, 95, 97, 134, 69, 92, 126, 79, 97, 128, 74, 91, 105, 98, 130, 89, 93, 67, 98, 100, 118, 103, 110, 80, 79, 91, 91, 97, 118, 139, 92, 131, 87, 116, 93, 110, 110, 131, 101, 108, 84, 78, 75, 164, 106, 110, 99, 113, 108, 112, 93, 91, 93, 100, 100, 79, 96, 113, 89, 118, 92, 88, 92, 87, 93, 93, 93, 90, 83, 96, 88, 89, 91, 93, 92, 97, 100, 100, 89, 91, 112, 115, 95, 91, 95, 104, 74, 48, 77, 104, 128, 101, 94, 104, 90, 100, 88, 93, 98, 100, 112, 84, 98, 97, 114, 96, 100, 109, 83, 90, 107, 96, 103, 91, 95, 105, 113, 80, 101, 89, 74, 90, 89, 110, 74, 93, 84, 83, 69, 77, 107, 93, 88, 108, 84, 121, 89, 104, 90, 86, 84, 108, 107, 96, 98, 105, 108, 94, 106, 102, 88, 102, 102, 97, 111, 100, 96, 98, 78, 81, 108, 89, 99, 89, 81, 92, 100, 89, 79, 91, 101, 104, 103, 86, 105, 93, 92, 98, 95, 93, 87, 93, 87, 128, 86, 95, 114, 93, 

   ###### Convertion successfull.

### Convert Budget & Box office to numbers
We print the nested budget list to explore the data.

In [19]:
print([movie.get('Budget', 'N/A') for movie in filmInfoList])

['N/A', '$1.49 million', '$2.6 million', '$2.28 million', '$600,000', '$950,000', '$858,000', 'N/A', '$788,000', 'N/A', '$1.35 million', '$2.125 million', 'N/A', '$1.5 million', '$1.5 million', 'N/A', '$2.9 million', '$1,800,000', '$3 million', 'N/A', '$4 million', '$2 million', '$300,000', '$1.8 million', 'N/A', '$5 million', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$700,000', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$6 million', 'under $1 million or $1,250,000', 'N/A', '$2 million', 'N/A', 'N/A', '$2.5 million', 'N/A', 'N/A', '$4 million', '$3.6 million', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', '$4.4–6 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', '

 Now we proccede to make the budget list more legible. We convert all the elements into numeric values, it's better fot future analyses.

In [20]:
#import re
amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"

word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})"
value_re = rf"\${number}"

def word_to_value(word):
    value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value*word_value

def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    return value

'''
money_conversion("$12.2 million") --> 12200000 ## Word syntax
money_conversion("$790,000") --> 790000        ## Value syntax
'''
def money_conversion(money):
    if money == "N/A":
        return None

    if isinstance(money, list):
        money = money[0]
        
    word_syntax = re.search(word_re, money, flags=re.I)
    value_syntax = re.search(value_re, money)

    if word_syntax:
        return parse_word_syntax(word_syntax.group())

    elif value_syntax:
        return parse_value_syntax(value_syntax.group())

    else:
        return None

Testing conversion

In [21]:
money_conversion(str(filmInfoList[-40]["Budget"]))

200000000.0

### Convert dates into datetime object

In [22]:
# Convert Dates into datetimes
print([movie.get('Release date', 'N/A') for movie in filmInfoList])

[['May 19, 1937'], ['December 21, 1937 ( Carthay Circle Theatre , Los Angeles , CA , premiere)'], ['February 7, 1940 ( Center Theatre )', 'February 23, 1940 (United States)'], ['November 13, 1940'], ['June 20, 1941'], ['October 23, 1941 (New York City)', 'October 31, 1941 (U.S.)'], ['August 9, 1942 (World Premiere-London)', 'August 13, 1942 (Premiere-New York City)', 'August 21, 1942 (U.S.)'], ['August 24, 1942 (World Premiere-Rio de Janeiro)', 'February 6, 1943 (U.S. Premiere-Boston)', 'February 19, 1943 (U.S.)'], ['July 17, 1943'], ['December 21, 1944 (Mexico City)', 'February 3, 1945 (US)'], ['April 20, 1946 (New York City premiere)', 'August 15, 1946 (U.S.)'], ['November 12, 1946 (Premiere: Atlanta, Georgia)', 'November 20, 1946'], ['September 27, 1947'], 'May 27, 1948', ['November 29, 1948 (Chicago, Illinois)', 'January 19, 1949 (Indianapolis, Indiana)'], ['October 5, 1949'], ['February 15, 1950 (Boston)', 'March 4, 1950 (United States)'], ['June 22, 1950 (World Premiere- London )

In [23]:
filmInfoList[-50]

{'title': 'Dangal',
 'Directed by': 'Nitesh Tiwari',
 'Produced by': ['Aamir Khan', 'Kiran Rao', 'Siddharth Roy Kapur'],
 'Written by': ['Nitesh Tiwari',
  'Piyush Gupta',
  'Shreyas Jain',
  'Nikhil Meharotra'],
 'Starring': ['Aamir Khan',
  'Sakshi Tanwar',
  'Fatima Sana Shaikh',
  'Zaira Wasim',
  'Sanya Malhotra',
  'Suhani Bhatnagar',
  'Aparshakti Khurana',
  'Girish Kulkarni'],
 'Narrated by': 'Aparshakti Khurana',
 'Music by': 'Pritam',
 'Cinematography': 'Setu (Satyajit Pande)',
 'Edited by': 'Ballu Saluja',
 'Production company': ['Aamir Khan Productions',
  'Walt Disney Pictures India'],
 'Distributed by': 'UTV Motion Pictures',
 'Release date': ['21 December 2016 (United States)',
  '23 December 2016 (India)'],
 'Running time': '161 minutes',
 'Country': 'India',
 'Language': 'Hindi',
 'Budget': '70 crore',
 'Box office': ['est.', '(', ')'],
 'Running time (int)': 161}

In [24]:
# June 28, 1950
from datetime import datetime

dates = [movie.get('Release date', 'N/A') for movie in filmInfoList]

def clean_date(date):
    return date.split("(")[0].strip()

def date_conversion(date):
    if isinstance(date, list):
        date = date[0]
        
    if date == "N/A":
        return None
        
    date_str = clean_date(date)

    fmts = ["%B %d, %Y", "%d %B %Y"]
    for fmt in fmts:
        try:
            return datetime.strptime(date_str, fmt)
        except:
            pass
    return None

In [25]:
for movie in filmInfoList:
    movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A'))

In [26]:
filmInfoList[50]

{'title': 'One Hundred and One Dalmatians',
 'Directed by': ['Clyde Geronimi', 'Hamilton Luske', 'Wolfgang Reitherman'],
 'Produced by': 'Walt Disney',
 'Story by': 'Bill Peet',
 'Based on': ['The Hundred and One Dalmatians', 'by', 'Dodie Smith'],
 'Starring': ['Rod Taylor',
  'Cate Bauer',
  'Betty Lou Gerson',
  'Ben Wright',
  'Bill Lee (singing voice)',
  'Lisa Davis',
  'Martha Wentworth'],
 'Music by': 'George Bruns',
 'Edited by': ['Roy M. Brewer, Jr.', 'Donald Halliday'],
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': ['January 25, 1961'],
 'Running time': '79 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$3.6 million',
 'Box office': '$303 million',
 'Running time (int)': 79,
 'Release date (datetime)': datetime.datetime(1961, 1, 25, 0, 0)}

### Saving Cleaned Dataset
We have correctly clean parts of our dataset. In this section we save and backup our work.

In [27]:
#import pickle
#Defining Functions
def save_data_pickle(name, data):
    with open(name, 'wb') as f:
        pickle.dump(data, f)
        
def load_data_pickle(name):
    with open(name, 'rb') as f:
        return pickle.load(f)

In [28]:
save_data_pickle("disney_movie_data_cleaned_more.pickle", filmInfoList)
a = load_data_pickle("disney_movie_data_cleaned_more.pickle")
a == filmInfoList

True

### Task V: Attach IMDB/Rotten Tomatoes/Metascore scores

In [29]:
filmInfoList = load_data_pickle('disney_movie_data_cleaned_more.pickle')

In [30]:
filmInfoList[-60]

{'title': 'The Finest Hours',
 'Directed by': 'Craig Gillespie',
 'Produced by': ['Jim Whitaker', 'Dorothy Aufiero'],
 'Screenplay by': ['Scott Silver', 'Paul Tamasy', 'Eric Johnson'],
 'Based on': ["The Finest Hours: The True Story of the U.S. Coast Guard's Most Daring Sea Rescue",
  'by',
  'Michael J. Tougias',
  'and',
  'Casey Sherman'],
 'Starring': ['Chris Pine',
  'Casey Affleck',
  'Ben Foster',
  'Holliday Grainger',
  'John Ortiz',
  'Eric Bana'],
 'Music by': 'Carter Burwell',
 'Cinematography': 'Javier Aguirresarobe',
 'Edited by': 'Tatiana S. Riegel',
 'Production company': ['Walt Disney Pictures',
  'Whitaker Entertainment',
  'Red Hawk Entertainment'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release date': ['January 25, 2016 ( TCL Chinese Theatre )',
  'January 29, 2016 (United States)'],
 'Running time': '117 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$70–80 million',
 'Box office': '$52.1 million',
 'Running tim

#### We explore the open moveie database API: http://www.omdbapi.com/?apikey=[yourkey]&

In [45]:
#import requests
#import urllib
#import os
def get_omdb_info(title):
    base_url = "http://www.omdbapi.com/?"
    #parameters = {"apikey": os.environ['OMDB_API_KEY'], 't': title}
    parameters = {"apikey": 'a638fc3a', 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

def get_rotten_tomato_score(omdb_info):
    ratings = omdb_info.get('Ratings', [])
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None

get_omdb_info("into the woods")

In [46]:
omdb_info

{'Title': 'Into the Woods',
 'Year': '2014',
 'Rated': 'PG',
 'Released': '25 Dec 2014',
 'Runtime': '125 min',
 'Genre': 'Adventure, Comedy, Drama, Fantasy, Musical',
 'Director': 'Rob Marshall',
 'Writer': 'James Lapine (screenplay by), James Lapine (based on the musical by)',
 'Actors': 'Anna Kendrick, Daniel Huttlestone, James Corden, Emily Blunt',
 'Plot': 'A witch tasks a childless baker and his wife with procuring magical items from classic fairy tales to reverse the curse put on their family tree.',
 'Language': 'English',
 'Country': 'USA',
 'Awards': 'Nominated for 3 Oscars. Another 11 wins & 71 nominations.',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTY4MzQ4OTY3NF5BMl5BanBnXkFtZTgwNjM5MDI3MjE@._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '5.9/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '71%'},
  {'Source': 'Metacritic', 'Value': '69/100'}],
 'Metascore': '69',
 'imdbRating': '5.9',
 'imdbVotes': '132,028',
 'imdbID': 'tt2180411',


In [42]:
for movie in filmInfoList:
    title = movie['title']
    omdb_info = get_omdb_info(title)
    movie['imdb'] = omdb_info.get('imdbRating', None)
    movie['metascore'] = omdb_info.get('Metascore', None)
    movie['rotten_tomatoes'] = get_rotten_tomato_score(omdb_info)

In [50]:
filmInfoList[50]

{'title': 'One Hundred and One Dalmatians',
 'Directed by': ['Clyde Geronimi', 'Hamilton Luske', 'Wolfgang Reitherman'],
 'Produced by': 'Walt Disney',
 'Story by': 'Bill Peet',
 'Based on': ['The Hundred and One Dalmatians', 'by', 'Dodie Smith'],
 'Starring': ['Rod Taylor',
  'Cate Bauer',
  'Betty Lou Gerson',
  'Ben Wright',
  'Bill Lee (singing voice)',
  'Lisa Davis',
  'Martha Wentworth'],
 'Music by': 'George Bruns',
 'Edited by': ['Roy M. Brewer, Jr.', 'Donald Halliday'],
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': ['January 25, 1961'],
 'Running time': '79 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$3.6 million',
 'Box office': '$303 million',
 'Running time (int)': 79,
 'Release date (datetime)': datetime.datetime(1961, 1, 25, 0, 0),
 'imdb': '7.3',
 'metascore': '83',
 'rotten_tomatoes': '98%'}

for movie in filmInfoList:
    movie['imdb'] = float(movie['imdb'])
    movie['metascore'] = float(movie['metascore'])
    movie['rotten_tomatoes'] = float(movie['rotten_tomatoes'].strip('%'))

 #### Saving data with pickle.

In [51]:
save_data_pickle('disney_movie_data_final.pickle', filmInfoList)

### Task VI: Save data as JSON & CSV

In [53]:
filmInfoList[50]

{'title': 'One Hundred and One Dalmatians',
 'Directed by': ['Clyde Geronimi', 'Hamilton Luske', 'Wolfgang Reitherman'],
 'Produced by': 'Walt Disney',
 'Story by': 'Bill Peet',
 'Based on': ['The Hundred and One Dalmatians', 'by', 'Dodie Smith'],
 'Starring': ['Rod Taylor',
  'Cate Bauer',
  'Betty Lou Gerson',
  'Ben Wright',
  'Bill Lee (singing voice)',
  'Lisa Davis',
  'Martha Wentworth'],
 'Music by': 'George Bruns',
 'Edited by': ['Roy M. Brewer, Jr.', 'Donald Halliday'],
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': ['January 25, 1961'],
 'Running time': '79 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$3.6 million',
 'Box office': '$303 million',
 'Running time (int)': 79,
 'Release date (datetime)': datetime.datetime(1961, 1, 25, 0, 0),
 'imdb': '7.3',
 'metascore': '83',
 'rotten_tomatoes': '98%'}

In [54]:
movie_info_copy = [movie.copy() for movie in filmInfoList]

In [55]:
for movie in movie_info_copy:
    current_date = movie['Release date (datetime)']
    if current_date:
        movie['Release date (datetime)'] = current_date.strftime("%B %d, %Y")
    else:
        movie['Release date (datetime)'] = None

In [57]:
saveData("disney_data_final.json", movie_info_copy)

#### Convert data to CSV

In [59]:
#import pandas as pd

df = pd.DataFrame(filmInfoList)

In [60]:
df.head()

Unnamed: 0,title,Production company,Release date,Running time,Country,Language,Box office,Running time (int),Release date (datetime),imdb,...,Cinematography,Edited by,Screenplay by,Production companies,Japanese,Hepburn,Adaptation by,Animation by,Traditional,Simplified
0,Academy Award Review of,Walt Disney Productions,"[May 19, 1937]",41 minutes (74 minutes 1966 release),United States,English,$45.472,41.0,1937-05-19,5.9,...,,,,,,,,,,
1,Snow White and the Seven Dwarfs,Walt Disney Productions,"[December 21, 1937 ( Carthay Circle Theatre , ...",83 minutes,United States,English,$418 million,83.0,1937-12-21,7.6,...,,,,,,,,,,
2,Pinocchio,Walt Disney Productions,"[February 7, 1940 ( Center Theatre ), February...",88 minutes,United States,English,$164 million,88.0,1940-02-07,7.4,...,,,,,,,,,,
3,Fantasia,Walt Disney Productions,"[November 13, 1940]",126 minutes,United States,English,$76.4–$83.3 million,126.0,1940-11-13,7.7,...,James Wong Howe,,,,,,,,,
4,The Reluctant Dragon,Walt Disney Productions,"[June 20, 1941]",74 minutes,United States,English,"$960,000 (worldwide rentals)",74.0,1941-06-20,6.9,...,Bert Giennon,Paul Weatherwax,,,,,,,,


In [61]:
df.to_csv("disney_movie_data_final.csv")

In [62]:
running_times = df.sort_values(['Running time (int)'],  ascending=False)
running_times.head(20)

Unnamed: 0,title,Production company,Release date,Running time,Country,Language,Box office,Running time (int),Release date (datetime),imdb,...,Cinematography,Edited by,Screenplay by,Production companies,Japanese,Hepburn,Adaptation by,Animation by,Traditional,Simplified
443,Night at the Museum,"[21 Laps Entertainment, Ingenious Film Partner...",,306 minutes,United States,English,$1.31 billion,306.0,NaT,6.4,...,,,,,,,,,,
302,Pirates of the Caribbean: At World's End,,"[May 19, 2007 ( Disneyland Resort ), May 25, 2...",168 minutes,United States,English,$961 million,168.0,2007-05-19,7.1,...,Dariusz Wolski,"[Craig Wood, Stephen Rivkin]",,"[Walt Disney Pictures, Jerry Bruckheimer Films]",,,,,,
86,The Happiest Millionaire,Walt Disney Productions,"[June 23, 1967, November 30, 1967]","[164 minutes, (, Los Angeles, premiere), 144 m...",United States,English,$5 million (U.S./Canada rentals),164.0,1967-06-23,6.8,...,Edward Colman,Cotton Warburton,A. J. Carothers,,,,,,,
401,Jagga Jasoos,"[Walt Disney Pictures India, Picture Shuru Ent...",[14 July 2017],162 minutes,India,Hindi,833.5 million,162.0,2017-07-14,,...,Ravi Varman,Amitabh Shukla,"[Anurag Basu, Dialogues in Rhyme:, Amitabh Bha...",,,,,,,
394,Dangal,"[Aamir Khan Productions, Walt Disney Pictures ...","[21 December 2016 (United States), 23 December...",161 minutes,India,Hindi,"[est., (, )]",161.0,2016-12-21,8.4,...,Setu (Satyajit Pande),Ballu Saluja,,,,,,,,
425,Hamilton,"[Walt Disney Pictures, 5000 Broadway Productio...","[July 3, 2020]",160 minutes,United States,English,,160.0,2020-07-03,8.6,...,Declan Quinn,Jonah Moran,,,,,,,,
382,ABCD 2,Walt Disney Pictures,[19 June 2015],154 minutes,India,Hindi,est.,154.0,2015-06-19,,...,Vijay Kumar Arora,Manan Sagar,,,,,,,,
296,Pirates of the Caribbean: Dead Man's Chest,,"[June 24, 2006 ( Disneyland Resort ), July 7, ...",151 minutes,United States,English,$1.066 billion,151.0,2006-06-24,7.3,...,Dariusz Wolski,"[Craig Wood, Stephen Rivkin]",,"[Walt Disney Pictures, Jerry Bruckheimer Films]",,,,,,
311,The Chronicles of Narnia: Prince Caspian,"[Walt Disney Pictures, Walden Media]","[May 7, 2008 ( New York City ), May 16, 2008 (...",150 minutes,"[United Kingdom, United States]",English,$419.7 million,150.0,2008-05-07,6.5,...,Karl Walter Lindenlaub,Sim Evan-Jones,"[Andrew Adamson, Christopher Markus Stephen Mc...",,,,,,,
364,The Lone Ranger,"[Walt Disney Pictures, Jerry Bruckheimer Films...","[June 22, 2013 ( Hyperion Theatre ), July 3, 2...",149 minutes,United States,English,$260.5 million,149.0,2013-06-22,6.4,...,Bojan Bazelli,"[James Haygood, Craig Wood]","[Justin Haythe, Ted Elliott, Terry Rossio]",,,,,,,
