<h2> Disney Dataset Creation (w/ Python BeautifulSoup) </h2>

<p> Scrape & clean a list of disney wikipedia pages to create a dataset to further analyze </p>

### Get Info Box (Store in Python dictionary)

#### Import Necessary Libraries

In [2]:
from bs4 import BeautifulSoup as bs
import requests

#### Load the webpage

In [3]:
r = requests.get("https://en.wikipedia.org/wiki/Toy_Story_3")

# Convert to a beautiful soup object
soup = bs(r.content)

# print out the HTML
contents = soup.prettify()

In [4]:
info_box = soup.find(class_="infobox vevent")
info_rows = info_box.find_all("tr")




In [5]:
def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

movie_info = {}

for index, row in enumerate(info_rows):
    if index == 0:
        movie_info['title'] = row.find("th").get_text(" ", strip=True)
    elif index == 1:
        continue
    else:
        content_key = row.find("th").get_text(" ", strip=True)
        content_value = get_content_value(row.find("td"))
        movie_info[content_key] = content_value
        


### Get info box for all movies

In [6]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")

# Convert to a beautiful soup object
soup = bs(r.content)

# print out the HTML
contents = soup.prettify()

In [7]:
movies = soup.select(".wikitable.sortable i")

movies[0]

<i><a href="/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)" title="Snow White and the Seven Dwarfs (1937 film)">Snow White and the Seven Dwarfs</a></i>

In [8]:
def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    elif row_data.find("br"):
        return [text for text in row_data.stripped_strings]
        
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

def clean_tags(soup):
    for tag in soup.find_all(["sup", "span"]):
        tag.decompose() 
    
    
def get_info_box(url):
    
    r = requests.get(url)
    soup = bs(r.content)
    info_box = soup.find(class_="infobox vevent")
    info_rows = info_box.find_all("tr")
    
    clean_tags(soup)
    
    movie_info = {}

    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['title'] = row.find("th").get_text(" ", strip=True)
        else:
            header = row.find('th')
            if header:
                content_key = row.find("th").get_text(" ", strip=True)
                content_value = get_content_value(row.find("td"))
                movie_info[content_key] = content_value
            
    return movie_info


In [9]:
get_info_box("https://en.wikipedia.org/wiki/One_Little_Indian_(film)")

{'title': 'One Little Indian',
 'Directed by': 'Bernard McEveety',
 'Written by': 'Harry Spalding',
 'Produced by': 'Winston Hibler',
 'Starring': ['James Garner',
  'Vera Miles',
  'Pat Hingle',
  "Clay O'Brien",
  'John Doucette',
  'Morgan Woodward',
  'Andrew Prine'],
 'Cinematography': 'Charles F. Wheeler',
 'Edited by': 'Robert Stafford',
 'Music by': 'Jerry Goldsmith',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': ['June 20, 1973'],
 'Running time': '90 Minutes',
 'Country': 'United States',
 'Language': 'English',
 'Box office': '$2 million'}

In [10]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")
soup = bs(r.content)
movies = soup.select(".wikitable.sortable i a")

base_path = "https://en.wikipedia.org/"

movie_info_list = []
for index, movie in enumerate(movies):
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path
        title = movie['title']
        
        movie_info_list.append(get_info_box(full_path))

    except Exception as e:
        print(movie.get_text())
        print(e)

Zorro the Avenger
'NoneType' object has no attribute 'find'
The Sign of Zorro
'NoneType' object has no attribute 'find'
Mighty Ducks the Movie: The First Face-Off
'NoneType' object has no attribute 'find'
Spirited Away
'NoneType' object has no attribute 'find'
Howl's Moving Castle
'NoneType' object has no attribute 'find'
Ponyo
'NoneType' object has no attribute 'find'
Tales from Earthsea
'NoneType' object has no attribute 'find'
The Secret World of Arrietty
'NoneType' object has no attribute 'find'
The Beatles: Get Back – The Rooftop Concert
'NoneType' object has no attribute 'find'
Zombies 3
'NoneType' object has no attribute 'find'
Elio
'NoneType' object has no attribute 'find_all'
61
'NoneType' object has no attribute 'find_all'
All Night Long
'NoneType' object has no attribute 'find'
Big Thunder Mountain Railroad
'NoneType' object has no attribute 'find_all'
Keeper of the Lost Cities
'NoneType' object has no attribute 'find_all'
Muppet Man
'NoneType' object has no attribute 'find_

In [11]:
len(movie_info_list)

543

#### Save / Reload Movie Data

In [12]:
import json

def save_data(title, data):
    with open(title, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

In [13]:
import json

def load_data(title):
    with open(title, encoding="utf-8") as f:
        return json.load(f)

In [14]:
save_data("disney_data_cleaned.json", movie_info_list)

### Task #3: Clean our data!

In [15]:
movie_info_list = load_data("disney_data_cleaned.json")

#### Subtasks
- ~Clean up references [1]~
- ~Convert running time into an integer~
- Convert dates into datetime object
- ~Split up the long strings~
- ~Convert Budget & Box office to numbers~

In [16]:
[movie.get('Running time', 'N/A') for movie in movie_info_list]

['83 minutes',
 '88 minutes',
 '126 minutes',
 '74 minutes',
 '64 minutes',
 '70 minutes',
 '42 minutes',
 '65 min',
 '71 minutes',
 '75 minutes',
 '94 minutes',
 '73 minutes',
 '75 minutes',
 '82 minutes',
 '68 minutes',
 '74 minutes',
 '96 minutes',
 '75 minutes',
 '84 minutes',
 '77 minutes',
 '92 minutes',
 '69 minutes',
 '81 minutes',
 ['60 minutes (VHS and Wild Discovery version)', '71 minutes (original)'],
 '127 minutes',
 '93 minutes',
 '76 minutes',
 '75 minutes',
 '73 minutes',
 '85 minutes',
 '81 minutes',
 '70 minutes',
 '90 min.',
 '80 minutes',
 '75 minutes',
 '84 minutes',
 '83 minutes',
 '72 minutes',
 '97 minutes',
 '75 minutes',
 '104 minutes',
 '93 minutes',
 '105 minutes',
 '95 minutes',
 '97 minutes',
 '134 minutes',
 '69 minutes',
 '92 minutes',
 '126 minutes',
 '79 minutes',
 '97 minutes',
 '128 minutes',
 '73 minutes',
 '91 minutes',
 '105 minutes',
 '98 minutes',
 '130 minutes',
 '89 minutes',
 '93 minutes',
 '67 minutes',
 '98 minutes',
 '100 minutes',
 '118 m

In [17]:
def minute_to_integer(running_time):
    if running_time == "N/A":
        return None
    if isinstance(running_time, list):
        entry = running_time[0]
        return int(entry.split(" ")[0])
        
    else:
        return int(running_time.split(" ")[0])

for movie in movie_info_list:
    movie['Running time (int)'] = minute_to_integer(movie.get('Running time', "N/A"))
        


In [18]:
movie_info_list[-10]

{'title': 'Robin Hood',
 'Directed by': 'Wolfgang Reitherman',
 'Story by': ['Larry Clemmons',
  'Ken Anderson',
  'Vance Gerry',
  'Frank Thomas',
  'Eric Cleworth',
  'Julius Svendsen',
  'David Michener'],
 'Based on': 'The legend of Robin Hood',
 'Produced by': 'Wolfgang Reitherman',
 'Starring': ['Peter Ustinov',
  'Phil Harris',
  'Brian Bedford',
  'Roger Miller',
  'Pat Buttram',
  'George Lindsey',
  'Andy Devine'],
 'Edited by': ['Tom Acosta', 'Jim Melton'],
 'Music by': 'George Bruns',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': ['November 8, 1973'],
 'Running time': '83 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$5 million',
 'Box office': '$33 million',
 'Running time (int)': 83}

In [19]:
print([movie.get('Running time (int)', 'N/A') for movie in movie_info_list])

[83, 88, 126, 74, 64, 70, 42, 65, 71, 75, 94, 73, 75, 82, 68, 74, 96, 75, 84, 77, 92, 69, 81, 60, 127, 93, 76, 75, 73, 85, 81, 70, 90, 80, 75, 84, 83, 72, 97, 75, 104, 93, 105, 95, 97, 134, 69, 92, 126, 79, 97, 128, 73, 91, 105, 98, 130, 89, 93, 67, 98, 100, 118, 103, 110, 80, 79, 91, 91, 97, 118, 139, 131, 92, 87, 116, 93, 114, 110, 131, 101, 110, 84, 78, 75, 164, 106, 110, 99, 113, 108, 102, 85, 91, 93, 100, 100, 79, 96, 113, 89, 118, 92, 88, 92, 87, 93, 93, 93, 90, 83, 96, 88, 89, 91, 93, 92, 97, 100, 100, 89, None, 91, 112, 115, 95, 91, 97, 104, 74, 48, 77, 104, 128, 101, 94, 104, 90, 100, 88, 93, 98, 112, 84, 97, 97, 114, 96, 97, 109, 83, 90, 107, 96, 103, 91, 95, 105, 113, 80, 101, 90, 74, 90, 89, 110, 74, 93, 84, 83, 74, 77, 107, 93, 88, 108, 84, 121, 89, 104, 90, 86, 84, 108, 107, 96, 98, 105, 108, 94, 106, 102, 69, 88, 102, 102, 97, 111, 92, 100, 96, 96, 78, 81, 108, 89, 99, 89, 81, 92, 100, 89, 79, 91, 81, 101, 104, 103, 86, 106, 74, 93, 92, 98, 76, 95, 72, 93, 87, 70, 93, 87

In [20]:
print([movie.get('Budget', 'N/A') for movie in movie_info_list])

['$1.5 million', '$2.6 million', '$2.28 million', '$600,000', '$950,000', '$858,000', 'N/A', '$788,000', 'N/A', '$1.35 million', '$2.125 million', 'N/A', '$1.5 million', '$1.5 million', 'N/A', '$2.2 million', '$1.8 million', '$3 million', 'N/A', '$4 million', '$2 million', '$300,000', '$1.8 million', 'N/A', '$5 million', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$700,000', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$6 million', 'under $1 million or $1,250,000', 'N/A', '$2 million', 'N/A', 'N/A', '$2.5 million', 'N/A', 'N/A', '$4 million', '$3.6 million', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', '$4.4–6 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', '$6.3 m

In [21]:
import re

amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"

word_re = rf"\${number}(-|\sto\s|—)?({number})?\s({amounts})"
value_re = rf"\${number}"

def word_to_value(word):
    value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",",""))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value*word_value

def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",",""))
    return value

'''
money_conversion("$12.2 million") --> 12200000 ## word syntax
money_conversion("$790,000") --> 790000		   ## Value syntax
'''


def money_conversion(money):
    
    if money == 'N/A':
        return None

    if isinstance(money,list):
        money = money[0]

    word_syntax = re.search(word_re, money, flags=re.I)
    value_syntax = re.search(value_re, money)

    if word_syntax:
        return parse_word_syntax(word_syntax.group())

    elif value_syntax:
        return parse_value_syntax(value_syntax.group())
    
    else:
        return None

print(money_conversion("$6.5 billion"))

6500000000.0


In [22]:
for movie in movie_info_list:
    movie['Budget (float)'] = money_conversion(movie.get('Budget', "N/A"))
    movie['Box office (float)'] = money_conversion(movie.get('Box office', "N/A"))

In [23]:
money_conversion(str(movie_info_list[-10]['Budget']))

5000000.0

In [24]:
# convert dates into datetime
print([movie.get('Release date', 'N/A') for movie in movie_info_list])

['N/A', 'N/A', ['November 13, 1940'], ['June 27, 1941'], 'N/A', 'N/A', 'N/A', ['July 17, 1943'], 'N/A', 'N/A', 'N/A', ['September 27, 1947'], 'May 27, 1948', 'N/A', ['October 5, 1949'], 'N/A', 'N/A', 'N/A', 'N/A', ['February 5, 1953'], ['July 23, 1953 (United States)'], ['November 10, 1953'], 'N/A', ['August 17, 1954'], ['December 23, 1954'], 'May 25, 1955', ['June 22, 1955'], ['September 14, 1955'], 'December 22, 1955', 'June 8, 1956', ['July 18, 1956'], ['September 4, 1956'], ['December 20, 1956'], 'June 19, 1957', 'August 28, 1957', ['December 25, 1957'], ['July 8, 1958'], ['August 12, 1958'], ['December 25, 1958'], ['January 29, 1959'], ['March 19, 1959'], 'N/A', ['November 10, 1959'], 'January 21, 1960 ( Sarasota, FL )', ['February 24, 1960'], 'May 19, 1960', 'N/A', ['November 1, 1960'], ['December 21, 1960'], ['January 25, 1961'], 'March 16, 1961', ['June 21, 1961'], ['July 12, 1961'], ['July 17, 1961'], ['December 14, 1961'], 'April 5, 1962', ['May 17, 1962'], ['June 6, 1962'], 

In [25]:
# June 28, 1960
from datetime import datetime

dates = [movie.get('Release date', 'N/A') for movie in movie_info_list]

def clean_date(date):
    return date.split('(')[0].strip()

def date_conversion(date):
    if isinstance(date, list):
        date = date[0]
        
    if date == 'N/A':
        return None
        
    date_str = clean_date(date)
    fmts = ["%B %d, %Y", "%d %B %Y", "%Y"]
    for fmt in fmts:
        try:
            return datetime.strptime(date_str, fmt)
        except:
            pass
    
    return None     

In [26]:
for movie in movie_info_list:
    movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A'))

In [27]:
movie_info_list[-9]

{'title': 'SpaceCamp',
 'Directed by': 'Harry Winer',
 'Screenplay by': ['Clifford Green', 'Casey T. Mitchell'],
 'Story by': ['Patrick Bailey', 'Larry B. Williams'],
 'Produced by': ['Patrick Bailey', 'Walter Coblenz'],
 'Starring': ['Kate Capshaw',
  'Lea Thompson',
  'Kelly Preston',
  'Larry B. Scott',
  'Leaf Phoenix',
  'Tate Donovan',
  'Tom Skerritt'],
 'Cinematography': 'William A. Fraker',
 'Edited by': ['Tim Board', 'John W. Wheeler'],
 'Music by': 'John Williams',
 'Production company': 'ABC Motion Pictures',
 'Distributed by': '20th Century Fox',
 'Release date': ['June 6, 1986 (United States)'],
 'Running time': '107 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$18 million or $25 million',
 'Box office': '$9,697,739 (USA)',
 'Running time (int)': 107,
 'Budget (float)': 18000000.0,
 'Box office (float)': 9697739.0,
 'Release date (datetime)': datetime.datetime(1986, 6, 6, 0, 0)}

In [28]:
import pickle

def save_data_pickle(name, data):
    with open(name,'wb') as f:
        pickle.dump(data, f)

In [29]:
import pickle

def load_data_pickle(name):
    with open(name,'rb') as f:
        return pickle.load(f)

In [30]:
save_data_pickle("disney_movie_data_cleaned_more.pickle", movie_info_list)

In [31]:
a = load_data_pickle("disney_movie_data_cleaned_more.pickle")
a[5]

{'title': 'Bambi',
 'Directed by': ['Supervising director',
  'David Hand',
  'Sequence directors',
  'James Algar',
  'Samuel Armstrong',
  'Graham Heid',
  'Bill Roberts',
  'Paul Satterfield',
  'Norman Wright'],
 'Story by': ['Story direction',
  'Perce Pearce',
  'Story adaptation',
  'Larry Morey',
  'Story development',
  'Vernon Stallings',
  'Melvin Shaw',
  'Carl Fallberg',
  'Chuck Couch',
  'Ralph Wright'],
 'Based on': ['Bambi, a Life in the Woods', 'by', 'Felix Salten'],
 'Produced by': 'Walt Disney',
 'Starring': 'see below',
 'Music by': ['Frank Churchill', 'Edward H. Plumb'],
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release dates': ['August 9, 1942 ( London )',
  'August 21, 1942 (United States)'],
 'Running time': '70 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$858,000',
 'Box office': '$267.4 million',
 'Running time (int)': 70,
 'Budget (float)': 858000.0,
 'Box office (float)': 2673

### Task #4: Attach IMDB/Rotten Tomatoes/Metascore scores

In [32]:
# API link 
# http://www.omdbapi.com/?apikey=[yourkey]&

In [33]:
import requests
import urllib
import os

def get_omdb_info(title):
    base_url = "http://www.omdbapi.com/?"
    parameters = {
        "apikey": 'c3355996',
        't': title
                 }
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

def get_rotten_tomato_score(omdb_info):
    ratings = omdb_info.get('Ratings', [])
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None


In [34]:
for movie in movie_info_list:
    title = movie['title']
    omdb_info = get_omdb_info(title)
    movie['imdb'] = omdb_info.get('imdbRating', None)
    movie['metascore'] = omdb_info.get('Metascore', None)
    movie['rotten_tomatoes'] = get_rotten_tomato_score(omdb_info)

In [35]:
movie_info_list[-10]

{'title': 'Robin Hood',
 'Directed by': 'Wolfgang Reitherman',
 'Story by': ['Larry Clemmons',
  'Ken Anderson',
  'Vance Gerry',
  'Frank Thomas',
  'Eric Cleworth',
  'Julius Svendsen',
  'David Michener'],
 'Based on': 'The legend of Robin Hood',
 'Produced by': 'Wolfgang Reitherman',
 'Starring': ['Peter Ustinov',
  'Phil Harris',
  'Brian Bedford',
  'Roger Miller',
  'Pat Buttram',
  'George Lindsey',
  'Andy Devine'],
 'Edited by': ['Tom Acosta', 'Jim Melton'],
 'Music by': 'George Bruns',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Distribution',
 'Release date': ['November 8, 1973'],
 'Running time': '83 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$5 million',
 'Box office': '$33 million',
 'Running time (int)': 83,
 'Budget (float)': 5000000.0,
 'Box office (float)': 33000000.0,
 'Release date (datetime)': datetime.datetime(1973, 11, 8, 0, 0),
 'imdb': '6.6',
 'metascore': '53',
 'rotten_tomatoes': '43%'}

In [36]:
save_data_pickle('disney_movie_data_final.pickle', movie_info_list)

### Task #5: Save Data as JSON & CSV

In [37]:
movie_info_list[30]

{'title': 'Davy Crockett and the River Pirates',
 'Directed by': 'Norman Foster',
 'Written by': ['Tom Blackburn', 'Norman Foster'],
 'Produced by': 'Bill Walsh',
 'Starring': ['Fess Parker', 'Buddy Ebsen', 'Jeff York'],
 'Cinematography': 'Bert Glennon',
 'Edited by': 'Stanley Johnson',
 'Music by': ['Thomas W. Blackburn (lyrics)',
  'George Bruns',
  'Edward H. Plumb (orchestration)'],
 'Color process': 'Technicolor',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Film Distribution Co., Inc.',
 'Release date': ['July 18, 1956'],
 'Running time': '81 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Running time (int)': 81,
 'Budget (float)': None,
 'Box office (float)': None,
 'Release date (datetime)': datetime.datetime(1956, 7, 18, 0, 0),
 'imdb': '6.6',
 'metascore': 'N/A',
 'rotten_tomatoes': None}

In [38]:
movie_info_copy = [movie.copy() for movie in movie_info_list]

In [39]:
for movie in movie_info_copy:
    current_date = movie['Release date (datetime)']
    if current_date:
        movie['Release date (datetime)'] = current_date.strftime("%B %d, %Y")
    else:
        movie['Release date (datetime)'] = None
    

In [40]:
movie_info_list[30]

{'title': 'Davy Crockett and the River Pirates',
 'Directed by': 'Norman Foster',
 'Written by': ['Tom Blackburn', 'Norman Foster'],
 'Produced by': 'Bill Walsh',
 'Starring': ['Fess Parker', 'Buddy Ebsen', 'Jeff York'],
 'Cinematography': 'Bert Glennon',
 'Edited by': 'Stanley Johnson',
 'Music by': ['Thomas W. Blackburn (lyrics)',
  'George Bruns',
  'Edward H. Plumb (orchestration)'],
 'Color process': 'Technicolor',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'Buena Vista Film Distribution Co., Inc.',
 'Release date': ['July 18, 1956'],
 'Running time': '81 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Running time (int)': 81,
 'Budget (float)': None,
 'Box office (float)': None,
 'Release date (datetime)': datetime.datetime(1956, 7, 18, 0, 0),
 'imdb': '6.6',
 'metascore': 'N/A',
 'rotten_tomatoes': None}

In [41]:
save_data("disney_data_final.json", movie_info_copy)

#### Convert data to CSV

In [42]:
import pandas as pd

df = pd.DataFrame(movie_info_list)

In [43]:
df.head()

Unnamed: 0,title,Directed by,Story by,Based on,Produced by,Starring,Music by,Production company,Distributed by,Release dates,...,Traditional Chinese,Simplified Chinese,Original title,Layouts by,Music,Lyrics,Book,Basis,Productions,Awards
0,Snow White and the Seven Dwarfs,"[David Hand, Perce Pearce, William Cottrell, L...","[Ted Sears, Richard Creedon, Otto Englander, D...","[Snow White, by the, Brothers Grimm]",Walt Disney,"[Adriana Caselotti, Roy Atwell, Pinto Colvig, ...","[Frank Churchill, Leigh Harline, Paul Smith]",Walt Disney Productions,RKO Radio Pictures,"[December 21, 1937 ( Carthay Circle Theatre ),...",...,,,,,,,,,,
1,Pinocchio,"[Ben Sharpsteen, Hamilton Luske, Bill Roberts,...","[Ted Sears, Otto Englander, Webb Smith, Willia...","[The Adventures of Pinocchio, by, Carlo Collodi]",Walt Disney,"[Cliff Edwards, Dickie Jones, Christian Rub, W...","[Leigh Harline, Paul J. Smith]",Walt Disney Productions,RKO Radio Pictures,"[February 7, 1940 ( Center Theatre ), February...",...,,,,,,,,,,
2,Fantasia,"[Samuel Armstrong, James Algar, Bill Roberts, ...","[Joe Grant, Dick Huemer]",,"[Walt Disney, Ben Sharpsteen]","[Leopold Stokowski, Deems Taylor]",See program,Walt Disney Productions,RKO Radio Pictures,,...,,,,,,,,,,
3,The Reluctant Dragon,"[Alfred Werker, (live action), Hamilton Luske,...",,,Walt Disney,"[Robert Benchley, Frances Gifford, Buddy Peppe...","[Frank Churchill, Larry Morey]",Walt Disney Productions,RKO Radio Pictures,,...,,,,,,,,,,
4,Dumbo,"[Ben Sharpsteen, Norman Ferguson, Wilfred Jack...","[Joe Grant, Dick Huemer]","[Dumbo, the Flying Elephant, by, Helen Aberson...",Walt Disney,"[Edward Brophy, Verna Felton, Cliff Edwards, H...","[Frank Churchill, Oliver Wallace]",Walt Disney Productions,RKO Radio Pictures,"[October 23, 1941 (New York City), October 31,...",...,,,,,,,,,,


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 543 entries, 0 to 542
Data columns (total 43 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   title                    543 non-null    object        
 1   Directed by              538 non-null    object        
 2   Story by                 179 non-null    object        
 3   Based on                 298 non-null    object        
 4   Produced by              529 non-null    object        
 5   Starring                 505 non-null    object        
 6   Music by                 536 non-null    object        
 7   Production company       211 non-null    object        
 8   Distributed by           541 non-null    object        
 9   Release dates            202 non-null    object        
 10  Running time             529 non-null    object        
 11  Country                  474 non-null    object        
 12  Language                 518 non-nul

In [45]:
df.to_csv('disney_movie_data_final.csv')