# Exploratory Data Analysis and Data Clean Up

The main goal of this notebook is to get a global view of the four dataset we found. In order to get as much useful information as possible, we will then merge the four dataset and get one clean data. In the later work, the clean dataset will be the only dataset we use. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt

## Explore the first dataset

In [14]:
# data1 comes from https://www.kaggle.com/sooaaib/walt-disney-movies?select=disney_movies.csv
# read data1
data1 = pd.read_csv('./data1.csv', error_bad_lines=False)
print('size of the first dataset is ', data1.shape)
data1.head()

size of the first dataset is  (437, 36)


Unnamed: 0.1,Unnamed: 0,title,Production company,Release date,Running time,Country,Language,Box office,Running time (int),Budget (float),...,Narrated by,Cinematography,Edited by,Screenplay by,Production companies,Japanese,Hepburn,Adaptation by,Traditional,Simplified
0,0,Academy Award Review of,Walt Disney Productions,"['May 19, 1937']",41 minutes (74 minutes 1966 release),United States,English,$45.472,41.0,,...,,,,,,,,,,
1,1,Snow White and the Seven Dwarfs,Walt Disney Productions,"['December 21, 1937 ( Carthay Circle Theatre ,...",83 minutes,United States,English,$418 million,83.0,1490000.0,...,,,,,,,,,,
2,2,Pinocchio,Walt Disney Productions,"['February 7, 1940 ( Center Theatre )', 'Febru...",88 minutes,United States,English,$164 million,88.0,2600000.0,...,,,,,,,,,,
3,3,Fantasia,Walt Disney Productions,"['November 13, 1940']",126 minutes,United States,English,$76.4–$83.3 million,126.0,2280000.0,...,Deems Taylor,James Wong Howe,,,,,,,,
4,4,The Reluctant Dragon,Walt Disney Productions,"['June 20, 1941']",74 minutes,United States,English,"$960,000 (worldwide rentals)",74.0,600000.0,...,,Bert Giennon,Paul Weatherwax,,,,,,,


In [10]:
print("Number of unique films: ", len(data1.title.unique()))
print("Number of unique release date: ", len(data1['Release date'].unique()))

Number of unique films:  424
Number of unique release date:  433


Among the 437 films, unique films take up 424, but the unique number of release date is different. Therefore, we deduce that there may be films that are remade.

In [11]:
data1['title'].value_counts()

Freaky Friday            2
Cinderella               2
The Parent Trap          2
Mulan                    2
Pete's Dragon            2
                        ..
The Haunted Mansion      1
Artemis Fowl             1
Cars 2                   1
Pinocchio                1
Planes: Fire & Rescue    1
Name: title, Length: 424, dtype: int64

In [12]:
data1.loc[data1['title'] == 'Freaky Friday']

Unnamed: 0.1,Unnamed: 0,title,Production company,Release date,Running time,Country,Language,Box office,Running time (int),Budget (float),...,Narrated by,Cinematography,Edited by,Screenplay by,Production companies,Japanese,Hepburn,Adaptation by,Traditional,Simplified
127,127,Freaky Friday,Walt Disney Productions,"['December 17, 1976']",95 minutes,United States,English,$36 million,95.0,5000000.0,...,,Charles F. Wheeler,Cotton Warburton,Mary Rodgers,,,,,,
265,265,Freaky Friday,"['Walt Disney Pictures', 'Gunn Films']","['August 4, 2003 ( Los Angeles )', 'August 6, ...",97 minutes,United States,English,$160.8 million,97.0,26000000.0,...,,Oliver Wood,Bruce Green,"['Heather Hach', 'Leslie Dixon']",,,,,,


We first consider different versions of the same film as different ones. There are 36 features in total in data1, but we are interested in only part of those features, select the useful features.

In [18]:
list(data1.columns)

['Unnamed: 0',
 'title',
 'Production company',
 'Release date',
 'Running time',
 'Country',
 'Language',
 'Box office',
 'Running time (int)',
 'Budget (float)',
 'Box office (float)',
 'Release date (datetime)',
 'imdb_rating',
 'imdb_votes',
 'imdb_id',
 'metascore',
 'rotten_tomatoes',
 'Directed by',
 'Produced by',
 'Written by',
 'Based on',
 'Starring',
 'Music by',
 'Distributed by',
 'Budget',
 'Story by',
 'Narrated by',
 'Cinematography',
 'Edited by',
 'Screenplay by',
 'Production companies',
 'Japanese',
 'Hepburn',
 'Adaptation by',
 'Traditional',
 'Simplified']

In [35]:
# select the features are useful to our project
data1_select = data1[['title',
 'Running time (int)',
 'Budget (float)',
 'Box office (float)',
 'Release date (datetime)',
 'imdb_rating',
 'metascore',
 'rotten_tomatoes',
 'Directed by',
 'Produced by',
 'Music by',
 'Distributed by']].rename(columns={'Running time (int)': 'Running time', 
                                    'Budget (float)': 'Budget',
                                    'Box office (float)': 'Box office',
                                    'Release date (datetime)': 'Release date',
                                    'imdb_rating': 'imdb'})
data1_select

Unnamed: 0,title,Running time,Budget,Box office,Release date,imdb,metascore,rotten_tomatoes,Directed by,Produced by,Music by,Distributed by
0,Academy Award Review of,41.0,,4.547200e+01,1937-05-19,7.2,7.2,,,,,
1,Snow White and the Seven Dwarfs,83.0,1490000.0,4.180000e+08,1937-12-21,7.6,7.6,,"['David Hand (supervising)', 'William Cottrell...",Walt Disney,"['Frank Churchill', 'Paul Smith', 'Leigh Harli...",RKO Radio Pictures
2,Pinocchio,88.0,2600000.0,1.640000e+08,1940-02-07,7.4,7.4,100%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...",Walt Disney,"['Leigh Harline', 'Paul J. Smith']",RKO Radio Pictures
3,Fantasia,126.0,2280000.0,8.330000e+07,1940-11-13,7.8,7.8,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Walt Disney', 'Ben Sharpsteen']",See program,"['Walt Disney Productions', 'RKO Radio Pictures']"
4,The Reluctant Dragon,74.0,600000.0,9.600000e+05,1941-06-20,6.9,6.9,67%,"['Alfred Werker', '(live action)', 'Hamilton L...",Walt Disney,"['Frank Churchill', 'Larry Morey']",RKO Radio Pictures
...,...,...,...,...,...,...,...,...,...,...,...,...
432,Soul,100.0,150000000.0,,2020-10-11,7.0,7.0,45%,Pete Docter,Dana Murray,"['Trent Reznor', 'Atticus Ross']","['Walt Disney Studios', 'Motion Pictures']"
433,Raya and the Last Dragon,,,,2021-03-12,,,,"['Don Hall', 'Carlos López Estrada']","['Osnat Shurer', 'Peter Del Vecho']",,"['Walt Disney Studios', 'Motion Pictures']"
434,Cruella,,,,2021-05-28,,,,Craig Gillespie,"['Kristin Burr', 'Andrew Gunn', 'Marc Platt']",,Walt Disney Studios Motion Pictures
435,Jungle Cruise,,,,2021-07-30,,,,Jaume Collet-Serra,"['John Davis', 'John Fox', 'Beau Flynn', 'Dway...","['James Newton Howard', 'Metallica']",Walt Disney Studios Motion Pictures


## Explore the second dataset

In [15]:
# read data2 from https://www.kaggle.com/dikshabhati2002/walt-disney-movies
data2 = pd.read_csv('./data2.csv', error_bad_lines=False)
print('size of the second dataset is ', data2.shape)
data2.head()

size of the second dataset is  (444, 21)


Unnamed: 0.1,Unnamed: 0,title,Production company,Country,Language,Running time,Budget,Box office,Release date,imdb,...,rotten_tomatoes,Directed by,Produced by,Based on,Starring,Music by,Distributed by,Cinematography,Edited by,Screenplay by
0,0,Academy Award Review of,Walt Disney Productions,United States,English,41.0,,45.472,1937-05-19,7.2,...,,,,,,,,,,
1,1,Snow White and the Seven Dwarfs,Walt Disney Productions,United States,English,83.0,1490000.0,418000000.0,1937-12-21,7.6,...,,"['David Hand (supervising)', 'William Cottrell...",Walt Disney,"['Snow White', 'by The', 'Brothers Grimm']","['Adriana Caselotti', 'Lucille La Verne', 'Har...","['Frank Churchill', 'Paul Smith', 'Leigh Harli...",RKO Radio Pictures,,,
2,2,Pinocchio,Walt Disney Productions,United States,English,88.0,2600000.0,164000000.0,1940-02-07,7.4,...,73%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...",Walt Disney,"['The Adventures of Pinocchio', 'by', 'Carlo C...","['Cliff Edwards', 'Dickie Jones', 'Christian R...","['Leigh Harline', 'Paul J. Smith']",RKO Radio Pictures,,,
3,3,Fantasia,Walt Disney Productions,United States,English,126.0,2280000.0,83300000.0,1940-11-13,7.7,...,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Walt Disney', 'Ben Sharpsteen']",,"['Leopold Stokowski', 'Deems Taylor']",See program,RKO Radio Pictures,James Wong Howe,,
4,4,The Reluctant Dragon,Walt Disney Productions,United States,English,74.0,600000.0,960000.0,1941-06-20,6.9,...,68%,"['Alfred Werker', '(live action)', 'Hamilton L...",Walt Disney,,"['Robert Benchley', 'Frances Gifford', 'Buddy ...","['Frank Churchill', 'Larry Morey']",RKO Radio Pictures,Bert Giennon,Paul Weatherwax,


In [22]:
print("Number of unique films: ", len(data2.title.unique()))
print("Number of unique release date: ", len(data2['Release date'].unique()))
data1['title'].value_counts()

Number of unique films:  431
Number of unique release date:  430


Freaky Friday            2
Cinderella               2
The Parent Trap          2
Mulan                    2
Pete's Dragon            2
                        ..
The Haunted Mansion      1
Artemis Fowl             1
Cars 2                   1
Pinocchio                1
Planes: Fire & Rescue    1
Name: title, Length: 424, dtype: int64

In [24]:
data2.loc[data2['title'] == 'Freaky Friday']

Unnamed: 0.1,Unnamed: 0,title,Production company,Country,Language,Running time,Budget,Box office,Release date,imdb,...,rotten_tomatoes,Directed by,Produced by,Based on,Starring,Music by,Distributed by,Cinematography,Edited by,Screenplay by
127,127,Freaky Friday,Walt Disney Productions,United States,English,95.0,5000000.0,36000000.0,1976-12-20,6.2,...,88%,Gary Nelson,Ron Miller,"['Freaky Friday', 'by Mary Rodgers']","['Jodie Foster', 'Barbara Harris', 'John Astin']",Johnny Mandel,Buena Vista Distribution,Charles F. Wheeler,Cotton Warburton,Mary Rodgers
265,265,Freaky Friday,"['Walt Disney Pictures', 'Gunn Films']",United States,English,97.0,26000000.0,160800000.0,2003-08-04,6.2,...,88%,Mark Waters,Andrew Gunn,"['Freaky Friday', 'by', 'Mary Rodgers']","['Jamie Lee Curtis', 'Lindsay Lohan', 'Harold ...",Rolfe Kent,Buena Vista Pictures,Oliver Wood,Bruce Green,"['Heather Hach', 'Leslie Dixon']"


In [26]:
list(data2.columns)

['Unnamed: 0',
 'title',
 'Production company',
 'Country',
 'Language',
 'Running time',
 'Budget',
 'Box office',
 'Release date',
 'imdb',
 'metascore',
 'rotten_tomatoes',
 'Directed by',
 'Produced by',
 'Based on',
 'Starring',
 'Music by',
 'Distributed by',
 'Cinematography',
 'Edited by',
 'Screenplay by']

In [27]:
# select features
data2_select = data2[['title',
 'Running time',
 'Budget',
 'Box office',
 'Release date',
 'imdb',
 'metascore',
 'rotten_tomatoes',
 'Directed by',
 'Produced by',
 'Music by',
 'Distributed by']]
data2_select

Unnamed: 0,title,Running time,Budget,Box office,Release date,imdb,metascore,rotten_tomatoes,Directed by,Produced by,Music by,Distributed by
0,Academy Award Review of,41.0,,4.547200e+01,1937-05-19,7.2,,,,,,
1,Snow White and the Seven Dwarfs,83.0,1490000.0,4.180000e+08,1937-12-21,7.6,95.0,,"['David Hand (supervising)', 'William Cottrell...",Walt Disney,"['Frank Churchill', 'Paul Smith', 'Leigh Harli...",RKO Radio Pictures
2,Pinocchio,88.0,2600000.0,1.640000e+08,1940-02-07,7.4,99.0,73%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...",Walt Disney,"['Leigh Harline', 'Paul J. Smith']",RKO Radio Pictures
3,Fantasia,126.0,2280000.0,8.330000e+07,1940-11-13,7.7,96.0,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Walt Disney', 'Ben Sharpsteen']",See program,RKO Radio Pictures
4,The Reluctant Dragon,74.0,600000.0,9.600000e+05,1941-06-20,6.9,,68%,"['Alfred Werker', '(live action)', 'Hamilton L...",Walt Disney,"['Frank Churchill', 'Larry Morey']",RKO Radio Pictures
...,...,...,...,...,...,...,...,...,...,...,...,...
439,The Little Mermaid,,,,,7.6,88.0,93%,Rob Marshall,"['Rob Marshall', 'John DeLuca', 'Marc Platt', ...","['Alan Menken (score and songs)', 'Howard Ashm...","['Walt Disney Studios', 'Motion Pictures']"
440,Peter Pan & Wendy,,,,,,,,David Lowery,"['Jim Whitaker', 'Joe Roth']",,Walt Disney Studios Motion Pictures
441,Home Alone,,,,,7.6,63.0,66%,Dan Mazer,"['Hutch Parker', 'Dan Wilson']",John Debney,Disney+
442,Shrunk,,,,,,,,Joe Johnston,"['David Hoberman', 'Todd Lieberman']",,"['Walt Disney Studios', 'Motion Pictures']"


## Explore the third dataset

In [16]:
# read data3 from https://www.kaggle.com/prateekmaj21/disney-movies
data3 = pd.read_csv('./data3.csv', error_bad_lines=False)
print('size of the third dataset is ', data3.shape)
data3.head()

size of the third dataset is  (579, 6)


Unnamed: 0,movie_title,release_date,genre,mpaa_rating,total_gross,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,G,184925485,5228953251
1,Pinocchio,1940-02-09,Adventure,G,84300000,2188229052
2,Fantasia,1940-11-13,Musical,G,83320000,2187090808
3,Song of the South,1946-11-12,Adventure,G,65000000,1078510579
4,Cinderella,1950-02-15,Drama,G,85000000,920608730


In this dataset, we are interested in the genre and the inflation adjusted gross.

In [52]:
data3_select = data3[['movie_title','release_date','genre','inflation_adjusted_gross']].rename(columns={'movie_title': 'title','release_date': 'Release date'})
data3_select

Unnamed: 0,title,Release date,genre,inflation_adjusted_gross
0,Snow White and the Seven Dwarfs,1937-12-21,Musical,5228953251
1,Pinocchio,1940-02-09,Adventure,2188229052
2,Fantasia,1940-11-13,Musical,2187090808
3,Song of the South,1946-11-12,Adventure,1078510579
4,Cinderella,1950-02-15,Drama,920608730
...,...,...,...,...
574,The Light Between Oceans,2016-09-02,Drama,12545979
575,Queen of Katwe,2016-09-23,Drama,8874389
576,Doctor Strange,2016-11-04,Adventure,232532923
577,Moana,2016-11-23,Adventure,246082029


## Explore the fourth dataset

In [17]:
# read data4 from https://www.kaggle.com/therealsampat/disney-movies-dataset
data4 = pd.read_csv('./data4.csv', error_bad_lines=False)
print('size of the fourth dataset is ', data4.shape)
data4.head()

size of the fourth dataset is  (432, 32)


Unnamed: 0.1,Unnamed: 0,title,Production company,Release date,Running time,Country,Language,Running time (int),Budget (float),Box office (float),...,Box office,Story by,Narrated by,Cinematography,Edited by,Screenplay by,Production companies,Adaptation by,Traditional,Simplified
0,0,Academy Award Review of,Walt Disney Productions,"['May 19, 1937']",41 minutes (74 minutes 1966 release),United States,English,41.0,,,...,,,,,,,,,,
1,1,Snow White and the Seven Dwarfs,Walt Disney Productions,"['December 21, 1937 ( Carthay Circle Theatre ,...",83 minutes,United States,English,83.0,1490000.0,418000000.0,...,$418 million,,,,,,,,,
2,2,Pinocchio,Walt Disney Productions,"['February 7, 1940 ( Center Theatre )', 'Febru...",88 minutes,United States,English,88.0,2600000.0,164000000.0,...,$164 million,"['Ted Sears', 'Otto Englander', 'Webb Smith', ...",,,,,,,,
3,3,Fantasia,Walt Disney Productions,"['November 13, 1940']",126 minutes,United States,English,126.0,2280000.0,83300000.0,...,$76.4–$83.3 million,"['Joe Grant', 'Dick Huemer']",Deems Taylor,James Wong Howe,,,,,,
4,4,The Reluctant Dragon,Walt Disney Productions,"['June 20, 1941']",74 minutes,United States,English,74.0,600000.0,960000.0,...,"$960,000 (worldwide rentals)",,,Bert Giennon,Paul Weatherwax,,,,,


In [31]:
list(data4.columns)

['Unnamed: 0',
 'title',
 'Production company',
 'Release date',
 'Running time',
 'Country',
 'Language',
 'Running time (int)',
 'Budget (float)',
 'Box office (float)',
 'Release date (datetime)',
 'imdb',
 'metascore',
 'rotten_tomatoes',
 'Directed by',
 'Produced by',
 'Written by',
 'Based on',
 'Starring',
 'Music by',
 'Distributed by',
 'Budget',
 'Box office',
 'Story by',
 'Narrated by',
 'Cinematography',
 'Edited by',
 'Screenplay by',
 'Production companies',
 'Adaptation by',
 'Traditional',
 'Simplified']

In [36]:
# select features
data4_select = data4[['title',
 'Running time (int)',
 'Budget (float)',
 'Box office (float)',
 'Release date (datetime)',
 'imdb',
 'metascore',
 'rotten_tomatoes',
 'Directed by',
 'Produced by',
 'Music by',
 'Distributed by',]].rename(columns={'Running time (int)': 'Running time', 
                                    'Budget (float)': 'Budget',
                                    'Box office (float)': 'Box office',
                                    'Release date (datetime)': 'Release date'})
data4_select

Unnamed: 0,title,Running time,Budget,Box office,Release date,imdb,metascore,rotten_tomatoes,Directed by,Produced by,Music by,Distributed by
0,Academy Award Review of,41.0,,,1937-05-19,7.2,,,,,,
1,Snow White and the Seven Dwarfs,83.0,1490000.0,418000000.0,1937-12-21,7.6,95.0,,"['David Hand (supervising)', 'William Cottrell...",Walt Disney,"['Frank Churchill', 'Paul Smith', 'Leigh Harli...",RKO Radio Pictures
2,Pinocchio,88.0,2600000.0,164000000.0,1940-02-07,7.4,99.0,100%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...",Walt Disney,"['Leigh Harline', 'Paul J. Smith']",RKO Radio Pictures
3,Fantasia,126.0,2280000.0,83300000.0,1940-11-13,7.8,96.0,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Walt Disney', 'Ben Sharpsteen']",See program,"['Walt Disney Productions', 'RKO Radio Pictures']"
4,The Reluctant Dragon,74.0,600000.0,960000.0,1941-06-20,6.9,,67%,"['Alfred Werker', '(live action)', 'Hamilton L...",Walt Disney,"['Frank Churchill', 'Larry Morey']",RKO Radio Pictures
...,...,...,...,...,...,...,...,...,...,...,...,...
427,Soul,,150000000.0,,2020-10-15,7.0,53.0,45%,Pete Docter,Dana Murray,"['Trent Reznor', 'Atticus Ross']","['Walt Disney Studios', 'Motion Pictures']"
428,Raya and the Last Dragon,,,,2021-03-12,,,,"['Don Hall', 'Carlos López Estrada']","['Osnat Shurer', 'Peter Del Vecho']",,"['Walt Disney Studios', 'Motion Pictures']"
429,Cruella,,,,2021-05-28,,,,Craig Gillespie,"['Kristin Burr', 'Andrew Gunn', 'Marc Platt']",,Walt Disney Studios Motion Pictures
430,Jungle Cruise,,,,2021-07-30,,,,Jaume Collet-Serra,"['John Davis', 'John Fox', 'Beau Flynn', 'Dway...","['James Newton Howard', 'Metallica']",Walt Disney Studios Motion Pictures


## Merge the selected dataset

In [48]:
data_temp = data1_select.append(data2_select, ignore_index=True)

# drop duplicated movies by title and releasse date
data_temp.drop_duplicates(subset =['title', 'Release date'], keep = 'first', inplace = True)
data_temp

Unnamed: 0,title,Running time,Budget,Box office,Release date,imdb,metascore,rotten_tomatoes,Directed by,Produced by,Music by,Distributed by
0,Academy Award Review of,41.0,,4.547200e+01,1937-05-19,7.2,7.2,,,,,
1,Snow White and the Seven Dwarfs,83.0,1490000.0,4.180000e+08,1937-12-21,7.6,7.6,,"['David Hand (supervising)', 'William Cottrell...",Walt Disney,"['Frank Churchill', 'Paul Smith', 'Leigh Harli...",RKO Radio Pictures
2,Pinocchio,88.0,2600000.0,1.640000e+08,1940-02-07,7.4,7.4,100%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...",Walt Disney,"['Leigh Harline', 'Paul J. Smith']",RKO Radio Pictures
3,Fantasia,126.0,2280000.0,8.330000e+07,1940-11-13,7.8,7.8,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Walt Disney', 'Ben Sharpsteen']",See program,"['Walt Disney Productions', 'RKO Radio Pictures']"
4,The Reluctant Dragon,74.0,600000.0,9.600000e+05,1941-06-20,6.9,6.9,67%,"['Alfred Werker', '(live action)', 'Hamilton L...",Walt Disney,"['Frank Churchill', 'Larry Morey']",RKO Radio Pictures
...,...,...,...,...,...,...,...,...,...,...,...,...
876,The Little Mermaid,,,,,7.6,88.0,93%,Rob Marshall,"['Rob Marshall', 'John DeLuca', 'Marc Platt', ...","['Alan Menken (score and songs)', 'Howard Ashm...","['Walt Disney Studios', 'Motion Pictures']"
877,Peter Pan & Wendy,,,,,,,,David Lowery,"['Jim Whitaker', 'Joe Roth']",,Walt Disney Studios Motion Pictures
878,Home Alone,,,,,7.6,63.0,66%,Dan Mazer,"['Hutch Parker', 'Dan Wilson']",John Debney,Disney+
879,Shrunk,,,,,,,,Joe Johnston,"['David Hoberman', 'Todd Lieberman']",,"['Walt Disney Studios', 'Motion Pictures']"


In [49]:
data_temp_2 = data_temp.append(data4_select, ignore_index=True)

# drop duplicated movies by title and releasse date
data_temp_2.drop_duplicates(subset =['title', 'Release date'], keep = 'first', inplace = True)
data_temp_2

Unnamed: 0,title,Running time,Budget,Box office,Release date,imdb,metascore,rotten_tomatoes,Directed by,Produced by,Music by,Distributed by
0,Academy Award Review of,41.0,,4.547200e+01,1937-05-19,7.2,7.2,,,,,
1,Snow White and the Seven Dwarfs,83.0,1490000.0,4.180000e+08,1937-12-21,7.6,7.6,,"['David Hand (supervising)', 'William Cottrell...",Walt Disney,"['Frank Churchill', 'Paul Smith', 'Leigh Harli...",RKO Radio Pictures
2,Pinocchio,88.0,2600000.0,1.640000e+08,1940-02-07,7.4,7.4,100%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...",Walt Disney,"['Leigh Harline', 'Paul J. Smith']",RKO Radio Pictures
3,Fantasia,126.0,2280000.0,8.330000e+07,1940-11-13,7.8,7.8,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Walt Disney', 'Ben Sharpsteen']",See program,"['Walt Disney Productions', 'RKO Radio Pictures']"
4,The Reluctant Dragon,74.0,600000.0,9.600000e+05,1941-06-20,6.9,6.9,67%,"['Alfred Werker', '(live action)', 'Hamilton L...",Walt Disney,"['Frank Churchill', 'Larry Morey']",RKO Radio Pictures
...,...,...,...,...,...,...,...,...,...,...,...,...
471,Shrunk,,,,,,,,Joe Johnston,"['David Hoberman', 'Todd Lieberman']",,"['Walt Disney Studios', 'Motion Pictures']"
472,Night at the Museum,306.0,387000000.0,1.310000e+09,,6.4,48.0,43%,Shawn Levy,"['Shawn Levy', 'Chris Columbus', 'Michael Barn...",Alan Silvestri,"['20th Century Studios', 'Disney+']"
761,Glory Road's,106.0,30000000.0,4.290000e+07,2006-01-13,,,,James Gartner,['Jerry Bruckheimer'],Trevor Rabin,Buena Vista Pictures
784,Beverly Hills Chihuahua,91.0,20000000.0,1.493000e+08,2008-09-26,3.8,41.0,40%,Raja Gosnell,"['David Hoberman', 'Todd Lieberman', 'John Jac...",Heitor Pereira,"['Walt Disney Studios', 'Motion Pictures']"


In [53]:
# merge above dataset with data3
data = pd.merge(data_temp_2, data3_select,  how='left', left_on=['title','Release date'], right_on = ['title','Release date'])
data

Unnamed: 0,title,Running time,Budget,Box office,Release date,imdb,metascore,rotten_tomatoes,Directed by,Produced by,Music by,Distributed by,genre,inflation_adjusted_gross
0,Academy Award Review of,41.0,,4.547200e+01,1937-05-19,7.2,7.2,,,,,,,
1,Snow White and the Seven Dwarfs,83.0,1490000.0,4.180000e+08,1937-12-21,7.6,7.6,,"['David Hand (supervising)', 'William Cottrell...",Walt Disney,"['Frank Churchill', 'Paul Smith', 'Leigh Harli...",RKO Radio Pictures,Musical,5.228953e+09
2,Pinocchio,88.0,2600000.0,1.640000e+08,1940-02-07,7.4,7.4,100%,"['Ben Sharpsteen', 'Hamilton Luske', 'Bill Rob...",Walt Disney,"['Leigh Harline', 'Paul J. Smith']",RKO Radio Pictures,,
3,Fantasia,126.0,2280000.0,8.330000e+07,1940-11-13,7.8,7.8,95%,"['Samuel Armstrong', 'James Algar', 'Bill Robe...","['Walt Disney', 'Ben Sharpsteen']",See program,"['Walt Disney Productions', 'RKO Radio Pictures']",Musical,2.187091e+09
4,The Reluctant Dragon,74.0,600000.0,9.600000e+05,1941-06-20,6.9,6.9,67%,"['Alfred Werker', '(live action)', 'Hamilton L...",Walt Disney,"['Frank Churchill', 'Larry Morey']",RKO Radio Pictures,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
471,Shrunk,,,,,,,,Joe Johnston,"['David Hoberman', 'Todd Lieberman']",,"['Walt Disney Studios', 'Motion Pictures']",,
472,Night at the Museum,306.0,387000000.0,1.310000e+09,,6.4,48.0,43%,Shawn Levy,"['Shawn Levy', 'Chris Columbus', 'Michael Barn...",Alan Silvestri,"['20th Century Studios', 'Disney+']",,
473,Glory Road's,106.0,30000000.0,4.290000e+07,2006-01-13,,,,James Gartner,['Jerry Bruckheimer'],Trevor Rabin,Buena Vista Pictures,,
474,Beverly Hills Chihuahua,91.0,20000000.0,1.493000e+08,2008-09-26,3.8,41.0,40%,Raja Gosnell,"['David Hoberman', 'Todd Lieberman', 'John Jac...",Heitor Pereira,"['Walt Disney Studios', 'Motion Pictures']",,


In [54]:
# count null value for each column
data.isnull().sum(axis = 0)

title                         0
Running time                 18
Budget                      186
Box office                  104
Release date                 29
imdb                         26
metascore                    41
rotten_tomatoes              59
Directed by                   1
Produced by                  12
Music by                     12
Distributed by                3
genre                       342
inflation_adjusted_gross    339
dtype: int64