In [2]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

from pathlib import Path

from datetime import datetime
from sklearn.model_selection import train_test_split

import warnings # necessary b/c pandas & statsmodels datetime issue
warnings.simplefilter(action="ignore")

# Import Data
#### First we import a few different lists of Disney films.  We have lists from 
 - Wikipedia 
 - D23 (The official Disney Fan Club) 
 - Disney.com
<br>

#### The lists all contain different titles and film data. We'll attempt to create a single master list of Disney films to work from.

### Import Disney.com data (All Films)

In [3]:
FL_Disney = pd.read_csv('DD_Films_List_Disney_Com/Film_list_Disney_com.csv')
FL_Disney.rename(     columns=({ 'Movie Title': 'title'}),     inplace=True )

print(FL_Disney.shape)
FL_Disney.head()

(657, 1)


Unnamed: 0,title
0,101 Dalmatians
1,101 Dalmatians (1996)
2,101 Dalmatians II: Patch's London Adventure
3,102 Dalmatians
4,"20,000 Leagues Under the Sea"


In some cases, the remake of the film included the year of release in the title.  In some cases, the orignal release includes the year in the title. This may not be big deal as the IMDB

In some cases, parentheses do not indicate the year of release i.e Frozen (Sing-Along Edition) or The Wizards Return: Alex vs. Alex (TV Special)

In [4]:
FL_Disney[FL_Disney['title'].str.contains("\(")]

Unnamed: 0,title
1,101 Dalmatians (1996)
35,Annie (1999)
90,Cinderella (1950)
169,Freaky Friday (1976)
171,Frozen (Sing-Along Edition)
348,Pete's Dragon (2016)
420,Sleeping Beauty (1959)
516,The Jungle Book (1967)
517,The Jungle Book (1994)
518,The Jungle Book (2016)


In some cases, the same or similar title is used multiple times. Since this list is from Disney.com, we will include all of these in our master list but may not include all of the films in our analyis.  Once we have more data, we will explore different groupings and determine which films or subset of films to work with.

In [5]:
FL_Disney[FL_Disney['title'].str.contains("101 Dalmatians")]

Unnamed: 0,title
0,101 Dalmatians
1,101 Dalmatians (1996)
2,101 Dalmatians II: Patch's London Adventure
412,Sing Along Songs: 101 Dalmatians -- Pongo & Pe...


The 1982 release of Annie was produced by Rastart and distributed by Columbia pictures.  Not Disney. The 1999 version was a Disney production.

In [6]:
FL_Disney[FL_Disney['title'].str.contains("Annie")]

Unnamed: 0,title
35,Annie (1999)


Cinderella was remade in 2015 as a live-action film. It seems odd that (1950) would be used to denote the original Disney production but we're going to leave it for now.

In [8]:
FL_Disney[FL_Disney['title'].str.contains("Cinderella")]

Unnamed: 0,title
89,Cinderella
90,Cinderella (1950)
91,Cinderella II: Dreams Come True
92,Cinderella III: A Twist in Time
393,Rodgers & Hammerstein's Cinderella


There was a Freaky Friday release in 1976, 2003, and 2018.  The 2018 release was a TV movie.  Not a theatrical release.

In [9]:
FL_Disney[FL_Disney['title'].str.contains("Freaky Friday")]

Unnamed: 0,title
167,Freaky Friday
168,Freaky Friday
169,Freaky Friday (1976)


In [10]:
FL_Disney[FL_Disney['title'].str.contains("Pete's Dragon")]

Unnamed: 0,title
347,Pete's Dragon
348,Pete's Dragon (2016)


In [11]:
FL_Disney[FL_Disney['title'].str.contains("Sleeping Beauty")]

Unnamed: 0,title
420,Sleeping Beauty (1959)
630,Waking Sleeping Beauty


When spot checking some title, I noticed that "Sing Along Songs: The Jungle Book -- The Bare ..." is in IMDB as "Disney Sing-Along-Songs: The Bare Necessities". Forthermore, when looking at IMDB, many of the "Disney Sing-Along-Songs" titles do not contain the name of the film. Creating a script to pick up the nuances may be time consuming with too litle reward. <mark>Sorting out the Sing-Along titles may be better done manually for the sake of time.</mark>

In [12]:
FL_Disney[FL_Disney['title'].str.contains("Jungle Book")]

Unnamed: 0,title
417,Sing Along Songs: The Jungle Book -- The Bare ...
516,The Jungle Book (1967)
517,The Jungle Book (1994)
518,The Jungle Book (2016)
519,The Jungle Book 2


In [13]:
FL_Disney[FL_Disney['title'].str.contains("Muppet Movie")]

Unnamed: 0,title
545,The Muppet Movie (1979)


In [14]:
FL_Disney[FL_Disney['title'].str.contains("Parent Trap")]

Unnamed: 0,title
555,The Parent Trap
556,The Parent Trap (1998)
557,The Parent Trap II


In [15]:
FL_Disney[FL_Disney['title'].str.contains("Shaggy Dog")]

Unnamed: 0,title
576,The Shaggy Dog
577,The Shaggy Dog (2006)


In [16]:
FL_Disney[FL_Disney['title'].str.contains("Winnie the Pooh")]

Unnamed: 0,title
537,The Many Adventures of Winnie the Pooh
646,Winnie the Pooh (2011)
647,Winnie the Pooh: A Very Merry Pooh Year
648,Winnie the Pooh: Springtime with Roo


### Import D23 data (All Films)

This one column hold the film title, year of release, and Motion Picture Association film rating. Will need to seperate those before moving forward.  <br>
Also, the D23 list (Disney's Official Fan Club) has almost 100 more titles that the Disney.com list.  657 to 748.  <mark> Will have to create a list of different titles and research further</mark>

In [17]:
FL_D23 = pd.read_csv('DD_Films_List_D23/D23_list.csv', header=None )
FL_D23.rename(     columns=({ 0: 'orig_title'}),     inplace=True )
print(FL_D23.shape)
FL_D23.head(5)

(749, 1)


Unnamed: 0,orig_title
0,1. 1937: Snow White and the Seven Dwarfs (G)
1,2. 1940: Pinocchio (G)
2,3. 1940: Fantasia (G)
3,4. 1941: The Reluctant Dragon
4,5. 1941: Dumbo (G)


In [18]:
# Splitting out the d_23 index number from the title
# Not dropping the d23_index just yet incase I need it to trouble shoot

index_split = FL_D23['orig_title'].str.split(".", n = 1, expand = True)    
FL_D23['d23_index'] = index_split[0]
FL_D23['d23_title_str'] = index_split[1]



In [19]:
FL_D23.head(5)

Unnamed: 0,orig_title,d23_index,d23_title_str
0,1. 1937: Snow White and the Seven Dwarfs (G),1,1937: Snow White and the Seven Dwarfs (G)
1,2. 1940: Pinocchio (G),2,1940: Pinocchio (G)
2,3. 1940: Fantasia (G),3,1940: Fantasia (G)
3,4. 1941: The Reluctant Dragon,4,1941: The Reluctant Dragon
4,5. 1941: Dumbo (G),5,1941: Dumbo (G)


In [20]:
# Splitting out the year number from the title

year_split = FL_D23['d23_title_str'].str.split(":", n = 1, expand = True)    

FL_D23['d23_year'] = year_split[0]
FL_D23['d23_title_str'] = year_split[1]

FL_D23 = FL_D23[['orig_title', 'd23_index', 'd23_year', 'd23_title_str']]

In [21]:
FL_D23.head(5)

Unnamed: 0,orig_title,d23_index,d23_year,d23_title_str
0,1. 1937: Snow White and the Seven Dwarfs (G),1,1937,Snow White and the Seven Dwarfs (G)
1,2. 1940: Pinocchio (G),2,1940,Pinocchio (G)
2,3. 1940: Fantasia (G),3,1940,Fantasia (G)
3,4. 1941: The Reluctant Dragon,4,1941,The Reluctant Dragon
4,5. 1941: Dumbo (G),5,1941,Dumbo (G)


In [22]:
# Now that we have the year seperate, we have to split the title further.
# The problem is that some of the strings end with the rating (G), (PG-13), etc.
# While others have no rating
# Further complicating things, some of the title strings include the name of the production company in a set or parenthesis
# and lastly, sometimes there is a production company with a rating, sometimes there is a production company with no rating
# sometimes there is no production company, but there is a rating, and sometimes there is neither
# After a few different attempts, the logic just got too messy. Going to use if/else (np.where) statements

In [23]:
# If the rating is included in the title string, then assign the rating value to the rating column

FL_D23['d23_rating'] = np.where(FL_D23['d23_title_str'].str.contains("\(G\)"), "G", 
                                np.where(FL_D23['d23_title_str'].str.contains("\(PG\)"), "PG", 
                                         np.where(FL_D23['d23_title_str'].str.contains("\(PG-13\)"), "PG-13", 
                                                  np.where(FL_D23['d23_title_str'].str.contains("\(R\)"), "R",
                                                           np.where(FL_D23['d23_title_str'].str.contains("\(PG_13\)"), "PG-13",
                                                                    np.where(FL_D23['d23_title_str'].str.contains("\(PG-13"), "PG-13",
                                                                             np.where(FL_D23['d23_title_str'].str.contains("\(NR\)"), "NaN",
                                                           "NaN" ) ) ) ) ) ) )

# If the rating is included in the title string, then strip it out

FL_D23['d23_title_str'] = np.where(FL_D23['d23_title_str'].str.contains("\(G\)"), FL_D23['d23_title_str'].str.replace("\(G\)",''), 
                                np.where(FL_D23['d23_title_str'].str.contains("\(PG\)"), FL_D23['d23_title_str'].str.replace("\(PG\)",''), 
                                         np.where(FL_D23['d23_title_str'].str.contains("\(PG-13\)"), FL_D23['d23_title_str'].str.replace("\(PG-13\)",''), 
                                                  np.where(FL_D23['d23_title_str'].str.contains("\(R\)"), FL_D23['d23_title_str'].str.replace("\(R\)",''),
                                                           np.where(FL_D23['d23_title_str'].str.contains("\(PG_13\)"), FL_D23['d23_title_str'].str.replace("\(PG_13\)",''),
                                                                    np.where(FL_D23['d23_title_str'].str.contains("\(PG-13"), FL_D23['d23_title_str'].str.replace("\(PG-13",''),
                                                                             np.where(FL_D23['d23_title_str'].str.contains("\(NR\)"), FL_D23['d23_title_str'].str.replace("\(NR\)",''),
                                                                    FL_D23['d23_title_str'] ) ) ) ) ) ) )

FL_D23 = FL_D23[['orig_title', 'd23_index', 'd23_year', 'd23_rating', 'd23_title_str']]

In [24]:
FL_D23.head(35)

Unnamed: 0,orig_title,d23_index,d23_year,d23_rating,d23_title_str
0,1. 1937: Snow White and the Seven Dwarfs (G),1,1937,G,Snow White and the Seven Dwarfs
1,2. 1940: Pinocchio (G),2,1940,G,Pinocchio
2,3. 1940: Fantasia (G),3,1940,G,Fantasia
3,4. 1941: The Reluctant Dragon,4,1941,,The Reluctant Dragon
4,5. 1941: Dumbo (G),5,1941,G,Dumbo
5,6. 1942: Bambi (G),6,1942,G,Bambi
6,7. 1943: Saludos Amigos,7,1943,,Saludos Amigos
7,8. 1943: Victory Through Air Power,8,1943,,Victory Through Air Power
8,9. 1945: The Three Caballeros (G),9,1945,G,The Three Caballeros
9,10. 1946: Make Mine Music,10,1946,,Make Mine Music


In [25]:
FL_D23.tail(35)

Unnamed: 0,orig_title,d23_index,d23_year,d23_rating,d23_title_str
714,715. 2016: Queen of Katwe (PG),715,2016,PG,Queen of Katwe
715,716. 2016: Doctor Strange (Marvel) (PG-13),716,2016,PG-13,Doctor Strange (Marvel)
716,717. 2016: Moana (PG),717,2016,PG,Moana
717,718. 2016: Rogue One: A Star Wars Story (Lucas...,718,2016,PG-13,Rogue One: A Star Wars Story (Lucasilm)
718,719. 2017: Dangal (Disney India),719,2017,,Dangal (Disney India)
719,720. 2017: Beauty and the Beast (PG),720,2017,PG,Beauty and the Beast
720,721. 2017: Born in China (Disneynature) (G),721,2017,G,Born in China (Disneynature)
721,"722. 2017: Guardians of the Galaxy, Vol. 2 (Ma...",722,2017,PG-13,"Guardians of the Galaxy, Vol. 2 (Marvel)"
722,723. 2017: Pirates of the Caribbean: Dead Men ...,723,2017,PG-13,Pirates of the Caribbean: Dead Men Tell No Ta...
723,724. 2017: Cars 3 (Pixar) (G),724,2017,G,Cars 3 (Pixar)


In [26]:
# Count the parenthesis remaining to understand how much clean up is left

FL_D23['d23_p_count'] = FL_D23['d23_title_str'].str.count("\(")

In [27]:
# 405 rows are ready to go
# 344 rows contain one set of parenthesis

FL_D23['d23_p_count'].value_counts()

0    405
1    344
Name: d23_p_count, dtype: int64

In [28]:
FL_D23[FL_D23['d23_p_count'] == 1]

Unnamed: 0,orig_title,d23_index,d23_year,d23_rating,d23_title_str,d23_p_count
155,156. 1984: Splash (Touchstone) (PG),156,1984,PG,Splash (Touchstone),1
157,158. 1984: Country (Touchstone) (PG),158,1984,PG,Country (Touchstone),1
158,159. 1985: Baby...Secret of the Lost Legend(To...,159,1985,PG,Baby...Secret of the Lost Legend(Touchstone),1
161,162. 1985: My Science Project (Touchstone) (PG),162,1985,PG,My Science Project (Touchstone),1
164,165. 1986: Down and Out in Beverly Hills (Touc...,165,1986,R,Down and Out in Beverly Hills (Touchstone),1
...,...,...,...,...,...,...
738,739. 2019: Captain Marvel (Marvel) (PG-13),739,2019,PG-13,Captain Marvel (Marvel),1
740,741. 2019: Penguins (Disneynature) (G),741,2019,G,Penguins (Disneynature),1
741,742. 2019: Avengers: Endgame (Marvel) (PG-13),742,2019,PG-13,Avengers: Endgame (Marvel),1
743,744. 2019: Toy Story 4 (Pixar) (G),744,2019,G,Toy Story 4 (Pixar),1


In [29]:
# Splitting out the title from everything else

studio_split = FL_D23['d23_title_str'].str.split("\(", n = 1, expand = True)

In [30]:
FL_D23['d23_title'] = studio_split[0]
FL_D23['d23_studio'] = studio_split[1]
FL_D23['d23_studio'] = FL_D23['d23_studio'].str.replace('\)', '')

# FL_D23[FL_D23['d23_p_count'] == 1]
FL_D23 = FL_D23[['orig_title', 'd23_index', 'd23_year', 'd23_rating', 'd23_title', 'd23_studio']]

In [31]:
FL_D23 = FL_D23[['orig_title', 'd23_index', 'd23_year', 'd23_rating', 'd23_title', 'd23_studio']]

In [33]:
FL_D23.head()

Unnamed: 0,orig_title,d23_index,d23_year,d23_rating,d23_title,d23_studio
0,1. 1937: Snow White and the Seven Dwarfs (G),1,1937,G,Snow White and the Seven Dwarfs,
1,2. 1940: Pinocchio (G),2,1940,G,Pinocchio,
2,3. 1940: Fantasia (G),3,1940,G,Fantasia,
3,4. 1941: The Reluctant Dragon,4,1941,,The Reluctant Dragon,
4,5. 1941: Dumbo (G),5,1941,G,Dumbo,


In [32]:
# spot checking some film titles to confirm that the splits all worked as expected.

# for val in FL_D23['d23_title']:
#     print(val)
    
# for val in FL_D23['d23_year']:
#     print(val)    

# There are a number of "Touchstone" and "Hollywood Pictures", so the split didn't work as expected
# for val in FL_D23['d23_rating']:
#     print(val)    

# for val in FL_D23['d23_studio']:
#     print(val)    

### Import Wikipedia data (All Films)

Wikipedia's list includes a lot more information, but only 425 titels, as compared to 657 from Disney.com and 748 from D23.  <mark> Will have to create a list of different titles and research further</mark>

In [4]:
FL_WK = pd.read_csv('DD_Films_LIst_WK/FIlms_List_WK.csv', )
print(FL_WK.shape)
FL_WK.head()

(425, 10)


Unnamed: 0,US Release,Other Release Date,Title,Co-production companies,Category,Direct to video or streaming exclusive Disney+,Premium video on demand release through Disney+,Simultaneous release to theatres and on premium video on demand,non-US Film,Notes
0,5/19/1937,,Academy Award Review of Walt Disney Cartoons,,Animated feature,,,,,
1,12/4/1939,,Snow White and the Seven Dwarfs,,Animated feature,,,,,
2,6/1/1940,,Pinocchio,,Animated feature,,,,,
3,10/23/1941,,Dumbo,,Animated feature,,,,,
4,11/2/1941,,Bambi,,Animated feature,,,,,


### Import Wikipedia data (Animated Films)

Wikipedia's list of Disney anitmated films.  A lot of movie info, but only 144 titles.  <mark> Will have to combine the movie data in to the master list</mark>

In [5]:
FL_WK_Animated = pd.read_csv('DD_Animated_List_WK/Disney_Animated_List_wk.csv', )
print(FL_WK_Animated.shape)
FL_WK_Animated.head()

(144, 14)


Unnamed: 0,Title,Original U.S. theatrical release date[rls 1],Other Theatrical Release Date,Animation studio[st 2],CoCredit,Released By,Film Type,Live-action / Animation hybrid,"Not produced by Disney, but released under its label.",US Release Exceptions,Released under the Touchstone Pictures label,Released by Disney outside North America,Released by Miramax Films when the studio was a subsidiary of Disney at the time of release,Release Note
0,Academy Award Review of Walt Disney Cartoons,5/19/1937,,Walt Disney Animation Studios (1937 - Present),,RKO Radio Pictures,Animation,,,,,,,
1,Snow White and the Seven Dwarfs,12/21/1937,,Walt Disney Animation Studios (1937 - Present),,RKO Radio Pictures,Animation,,,,,,,
2,Pinocchio,2/7/1940,,Walt Disney Animation Studios (1937 - Present),,RKO Radio Pictures,Animation,,,,,,,
3,Fantasia,11/13/1940,1/29/1941,Walt Disney Animation Studios (1937 - Present),,Walt Disney Productions / RKO Radio Pictures,Animation,Live-action / Animation hybrid,,,,,,Originally distributed by Walt Disney Producti...
4,The Reluctant Dragon,6/20/1941,,Walt Disney Animation Studios (1937 - Present),,RKO Radio Pictures,Animation,Live-action / Animation hybrid,,,,,,
