# 07 EDA and Clean Principals Data


## 07.01 Imports


### 07.01.01 Python Imports


In [850]:
import pandas as pd
import numpy as np
import re
import datetime

### 07.01.02 Import Principal History Data


In [790]:
principals_history = pd.read_csv('../Bens_Data/directors_writers_combined_history_updated.csv')
principals_history.drop(columns=['Unnamed: 0','_merge'], inplace=True)

In [791]:
principals_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7765 entries, 0 to 7764
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          7765 non-null   object 
 1   titleType       7765 non-null   object 
 2   primaryTitle    7765 non-null   object 
 3   startYear       7765 non-null   object 
 4   runtimeMinutes  7765 non-null   object 
 5   genres          7765 non-null   object 
 6   newurl          7765 non-null   object 
 7   mpaarating      7765 non-null   object 
 8   rlsdt           7765 non-null   object 
 9   budget          7765 non-null   object 
 10  wordlwide       7765 non-null   object 
 11  averageRating   6003 non-null   float64
 12  numVotes        6003 non-null   float64
dtypes: float64(2), object(11)
memory usage: 788.8+ KB


To demonstrate or include a principal's trackrecord as part of our model, we're going to focus on the start year (so that get ratings for previous projects), genre (so we can get ratings for similar projects), average rating, and numvotes. We will attempt to use the rlsdat when we can but it might prove too dificult.  We also have too much missing data from the budget and worldwide columns to use those for now.  

## 07.02 Average Ratings


we'll remove any missing the averageRating and/or numVotes. We're not going to attempt to guess or impute the ratings or number of votes.  We're just going to remove those that are missing.

In [792]:
principals_history.head(10)

Unnamed: 0,tconst,titleType,primaryTitle,startYear,runtimeMinutes,genres,newurl,mpaarating,rlsdt,budget,wordlwide,averageRating,numVotes
0,tt0113403,movie,A Midwinter's Tale,1995,99,Comedy,https://www.imdb.com/title/tt0113403,R,"February 16, 1996 (United States)",error,"$469,571",7.2,2577.0
1,tt0450972,movie,As You Like It,2006,127,"Comedy,Drama,Romance",https://www.imdb.com/title/tt0450972,PG,"September 21, 2007 (United Kingdom)",error,"$563,162",6.1,3354.0
2,tt0475331,movie,The Magic Flute,2006,135,"Drama,Musical,Romance",https://www.imdb.com/title/tt0475331,error,"December 13, 2006 (France)","$27,000,000 (estimated)","$2,000,853",6.5,1236.0
3,tt0800369,movie,Thor,2011,115,"Action,Adventure,Fantasy",https://www.imdb.com/title/tt0800369,PG-13,"May 6, 2011 (United States)","$150,000,000 (estimated)","$449,326,618",7.0,810857.0
4,tt11229040,movie,Untitled Bee Gees Biopic,2022,\N,"Biography,Drama,Music",https://www.imdb.com/title/tt11229040,error,error,error,error,,
5,tt12789558,movie,Belfast,2021,98,"Biography,Drama,History",https://www.imdb.com/title/tt12789558,PG-13,"November 12, 2021 (United States)",error,"$46,922,870",7.3,44239.0
6,tt1661199,movie,Cinderella,2015,105,"Adventure,Drama,Family",https://www.imdb.com/title/tt1661199,PG,"March 13, 2015 (United States)","$95,000,000 (estimated)","$542,358,331",6.9,173300.0
7,tt3089630,movie,Artemis Fowl,2020,95,"Adventure,Family,Fantasy",https://www.imdb.com/title/tt3089630,PG,"June 12, 2020 (United States)",error,error,4.2,27155.0
8,tt5226844,movie,Branagh Theatre Live: The Winter's Tale,2015,180,Drama,https://www.imdb.com/title/tt5226844,error,"December 4, 2019 (United States)",error,"$141,143",7.8,269.0
9,tt5943392,movie,Branagh Theatre Live: Romeo and Juliet,2016,165,Romance,https://www.imdb.com/title/tt5943392,error,"November 18, 2016 (Japan)",error,"$1,191,021",8.2,377.0


In [793]:
principals_history['averageRating'].isnull().sum(), principals_history['numVotes'].isnull().sum()

(1762, 1762)

In [794]:
principals_history.drop(principals_history[principals_history['averageRating'].isnull()].index, inplace = True)

In [795]:
principals_history['averageRating'].isnull().sum(), principals_history['numVotes'].isnull().sum()

(0, 0)

## 07.03 Start Year

In [796]:
principals_history['startYear'].isnull().sum(), principals_history['startYear'].str.startswith('\\N').sum()

(0, 0)

## 07.04 Genres

We're not going to attempt to guess or impute the genres.  However, we would like to use them if at all possible. Any genre data that is missing will be filled with "unknown". We will do some vectorization on these columns later so that we can try to get track records for similar categories when building our final model.

In [797]:
principals_history['genres'].isnull().sum(), principals_history['genres'].str.startswith('\\N').sum()

(0, 47)

In [806]:
principals_history['genres'].value_counts()

Drama                         434
Comedy                        353
Adventure,Animation,Comedy    199
Western                       186
Comedy,Romance                147
                             ... 
Action,History,Western          1
War,Western                     1
Adventure,Musical,War           1
Adventure,Mystery,War           1
Biography,Drama,Western         1
Name: genres, Length: 485, dtype: int64

In [798]:
principals_history[principals_history['genres'].str.startswith('\\N')]

Unnamed: 0,tconst,titleType,primaryTitle,startYear,runtimeMinutes,genres,newurl,mpaarating,rlsdt,budget,wordlwide,averageRating,numVotes
572,tt2546296,movie,The 2006 Academy Award Nominated Short Films: ...,2007,45,\N,https://www.imdb.com/title/tt2546296,error,"February 16, 2007 (United States)",error,error,5.9,29.0
1746,tt0102098,movie,Im Kreise der Lieben,1991,79,\N,https://www.imdb.com/title/tt0102098,error,"December 12, 1991 (Germany)",error,error,6.9,18.0
2107,tt2776272,movie,Cigars: The Heart and Soul of Cuba,2011,51,\N,https://www.imdb.com/title/tt2776272,Not Rated,December 2011 (United States),error,error,7.4,7.0
2329,tt0219611,movie,Cosmo's Tale,1998,\N,\N,https://www.imdb.com/title/tt0219611,error,error,error,error,6.9,16.0
2888,tt0022049,movie,"The Land of Oz, a Sequel to the 'Wizard of Oz'",1932,\N,\N,https://www.imdb.com/title/tt0022049,error,February 1932 (United States),error,error,6.8,21.0
3025,tt0127905,movie,Store forventninger,1922,\N,\N,https://www.imdb.com/title/tt0127905,error,"August 28, 1922 (Denmark)",error,error,6.1,12.0
3111,tt0070025,movie,The Devil's Elixirs,1973,96,\N,https://www.imdb.com/title/tt0070025,error,"April 6, 1973 (Czechoslovakia)",error,error,6.0,23.0
3126,tt1686881,movie,Juwelen,1930,\N,\N,https://www.imdb.com/title/tt1686881,error,"May 30, 1930 (Austria)",error,error,6.4,13.0
3202,tt0259478,movie,Oru Thayin Sabhatham,1987,\N,\N,https://www.imdb.com/title/tt0259478,error,error,error,error,5.6,13.0
3223,tt1369559,movie,Raja Nanna Raja,1976,133,\N,https://www.imdb.com/title/tt1369559,Not Rated,error,error,error,8.7,48.0


In [799]:
principals_history.loc[(principals_history['genres'].str.startswith('\\N')), 'genres'] = 'unknown'

In [800]:
principals_history['genres'].isnull().sum(), principals_history['genres'].str.startswith('\\N').sum()

(0, 0)

## 07.05 Run Time

In [801]:
principals_history[principals_history['runtimeMinutes'].str.startswith('\\N')]

Unnamed: 0,tconst,titleType,primaryTitle,startYear,runtimeMinutes,genres,newurl,mpaarating,rlsdt,budget,wordlwide,averageRating,numVotes
223,tt0429478,movie,3-D Rocks,2005,\N,Documentary,https://www.imdb.com/title/tt0429478,error,error,error,error,7.4,27.0
407,tt1954412,movie,Someone Like Me,2012,\N,Drama,https://www.imdb.com/title/tt1954412,error,"March 1, 2012 (Switzerland)",error,error,7.1,144.0
1072,tt2710534,movie,Arrive Alive,1990,\N,"Action,Comedy,Mystery",https://www.imdb.com/title/tt2710534,error,1990 (United States),error,error,5.8,11.0
1239,tt0300227,movie,Het mysterie van het lam,1987,\N,Animation,https://www.imdb.com/title/tt0300227,error,error,error,error,5.4,8.0
1261,tt0499608,movie,Tranced,2010,\N,"Drama,Fantasy,Romance",https://www.imdb.com/title/tt0499608,error,2010 (United States),"$4,000,000 (estimated)",error,5.4,49.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7675,tt0427626,movie,Ang panday,1980,\N,"Action,Fantasy,Horror",https://www.imdb.com/title/tt0427626,error,"December 25, 1980 (Philippines)",error,error,4.5,82.0
7676,tt0428937,movie,Ang panday IV,1984,\N,"Action,Adventure,Fantasy",https://www.imdb.com/title/tt0428937,error,"December 25, 1984 (Philippines)",error,error,5.4,14.0
7677,tt0436659,movie,Pagbabalik ng panday,1981,\N,"Action,Adventure,Fantasy",https://www.imdb.com/title/tt0436659,error,"December 25, 1981 (Philippines)",error,error,4.7,21.0
7702,tt1305840,movie,Aloo Chaat,2009,\N,"Comedy,Drama,Romance",https://www.imdb.com/title/tt1305840,error,"March 20, 2009 (India)",error,"$1,213,516",5.9,932.0


There remains 157 films without run times.  After spot checking a few, some were made, some were never released, some were released but not in the States.  Some border on pornography.   We may or may not use movies of similar runtimes, so we're just going to leave the data in here for now.


In [805]:
principals_history.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6003 entries, 0 to 7756
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          6003 non-null   object 
 1   titleType       6003 non-null   object 
 2   primaryTitle    6003 non-null   object 
 3   startYear       6003 non-null   object 
 4   runtimeMinutes  6003 non-null   object 
 5   genres          6003 non-null   object 
 6   newurl          6003 non-null   object 
 7   mpaarating      6003 non-null   object 
 8   rlsdt           6003 non-null   object 
 9   budget          6003 non-null   object 
 10  wordlwide       6003 non-null   object 
 11  averageRating   6003 non-null   float64
 12  numVotes        6003 non-null   float64
dtypes: float64(2), object(11)
memory usage: 656.6+ KB


## 07.06 Budget and Worldwide Revenue


Budget and Worldwide should be numbers, not object.  Let's figure out what's going on.

In [803]:
principals_history['budget'].isnull().sum(), principals_history['budget'].str.startswith('\\N').sum(), principals_history['budget'].str.contains('error').sum()

(0, 0, 3835)

In [804]:
principals_history['wordlwide'].isnull().sum(), principals_history['wordlwide'].str.startswith('\\N').sum(), principals_history['wordlwide'].str.contains('error').sum()

(0, 0, 3813)

In [807]:
principals_history['wordlwide'].value_counts()

error           3813
$469,571           1
$186,797,986       1
$369,884,651       1
$17,492,014        1
                ... 
$51,024,243        1
$69,319,426        1
$76,347,426        1
$44,095,996        1
$62,856,743        1
Name: wordlwide, Length: 2191, dtype: int64

It looks like 'error' is the only thing preventing 'wordlwide' from being reconginzed as a number.

In [808]:
principals_history['wordlwide'].head(60)

0           $469,571
1           $563,162
2         $2,000,853
3       $449,326,618
5        $46,922,870
6       $542,358,331
7              error
8           $141,143
9         $1,191,021
10             error
11       $78,371,200
12      $131,060,248
13       $90,000,098
14      $520,881,154
15      $378,882,411
16    $2,201,647,264
17       $27,570,076
18    $2,847,379,794
25       $21,095,638
26       $69,821,334
27       $85,313,124
28       $36,611,610
29       $57,269,863
30      $126,297,830
31      $152,368,585
32      $137,783,840
33       $48,424,341
34      $355,237,933
35      $309,492,681
36       $35,242,897
37      $345,823,032
38      $316,791,257
39       $38,364,277
40      $108,539,911
41      $760,006,945
42      $485,930,816
43       $27,426,335
48       $93,920,758
49       $69,721,966
51       $96,983,009
52             error
53       $12,283,966
54      $220,021,259
55      $392,924,807
58             error
59        $8,083,942
61            $3,414
62           

In [809]:
principals_history['wordlwide'].tail(60)

7597    $414,351,546
7603    $410,902,662
7606    $125,897,478
7609     $28,748,076
7611           error
7612           error
7613           error
7614           error
7615           error
7616           error
7617           error
7618           error
7619           error
7621     $11,229,399
7622           error
7634    $433,477,601
7638    $235,956,898
7641     $96,942,115
7642     $95,017,038
7650        $291,742
7653         $68,643
7655           error
7657           error
7665        $499,649
7666         $17,722
7667           error
7668           error
7669      $2,408,629
7670           error
7672           error
7673           error
7674           error
7675           error
7676           error
7677           error
7679           error
7685           error
7692     $64,626,786
7693    $400,671,789
7695    $303,144,152
7700    $409,231,607
7702      $1,213,516
7703     $29,584,292
7704           error
7705    $214,215,889
7707        $114,603
7713     $59,945,012
7715         

Similarly, it looks like 'error' and the trailing '(estimated)' are preventing 'budget' from being reconginzed as a number.

In [810]:
principals_history['budget'].head(60)

0                        error
1                        error
2      $27,000,000 (estimated)
3     $150,000,000 (estimated)
5                        error
6      $95,000,000 (estimated)
7                        error
8                        error
9                        error
10        $145,786 (estimated)
11      $6,400,000 (estimated)
12     $18,500,000 (estimated)
13     $70,000,000 (estimated)
14    $102,000,000 (estimated)
15    $115,000,000 (estimated)
16    $200,000,000 (estimated)
17     $13,000,000 (estimated)
18    $237,000,000 (estimated)
25      $8,100,000 (estimated)
26     $11,000,000 (estimated)
27     $17,500,000 (estimated)
28     $18,000,000 (estimated)
29     $35,000,000 (estimated)
30     $20,000,000 (estimated)
31     $40,000,000 (estimated)
32     $60,000,000 (estimated)
33      $6,000,000 (estimated)
34     $52,000,000 (estimated)
35     $80,000,000 (estimated)
36     $80,000,000 (estimated)
37    $123,000,000 (estimated)
38     $58,000,000 (estimated)
39     $

In [811]:
principals_history['budget'].tail(60)

7597     $81,000,000 (estimated)
7603    $104,000,000 (estimated)
7606                       error
7609     €22,000,000 (estimated)
7611                       error
7612                       error
7613                       error
7614                       error
7615                       error
7616                       error
7617                       error
7618                       error
7619                       error
7621                       error
7622                       error
7634    $150,000,000 (estimated)
7638     $25,000,000 (estimated)
7641     $20,000,000 (estimated)
7642     $21,000,000 (estimated)
7650      $7,000,000 (estimated)
7653                       error
7655                       error
7657                       error
7665                       error
7666                       error
7667                       error
7668                       error
7669                       error
7670                       error
7672                       error
7673      

If we removed the (esitmated), and wrote the formula so that it skipped 'error', we could still get ROI for some 2,000 films.

In [812]:
principals_history.head()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,runtimeMinutes,genres,newurl,mpaarating,rlsdt,budget,wordlwide,averageRating,numVotes
0,tt0113403,movie,A Midwinter's Tale,1995,99,Comedy,https://www.imdb.com/title/tt0113403,R,"February 16, 1996 (United States)",error,"$469,571",7.2,2577.0
1,tt0450972,movie,As You Like It,2006,127,"Comedy,Drama,Romance",https://www.imdb.com/title/tt0450972,PG,"September 21, 2007 (United Kingdom)",error,"$563,162",6.1,3354.0
2,tt0475331,movie,The Magic Flute,2006,135,"Drama,Musical,Romance",https://www.imdb.com/title/tt0475331,error,"December 13, 2006 (France)","$27,000,000 (estimated)","$2,000,853",6.5,1236.0
3,tt0800369,movie,Thor,2011,115,"Action,Adventure,Fantasy",https://www.imdb.com/title/tt0800369,PG-13,"May 6, 2011 (United States)","$150,000,000 (estimated)","$449,326,618",7.0,810857.0
5,tt12789558,movie,Belfast,2021,98,"Biography,Drama,History",https://www.imdb.com/title/tt12789558,PG-13,"November 12, 2021 (United States)",error,"$46,922,870",7.3,44239.0


In [813]:
principals_history['budget'].head(50)

0                        error
1                        error
2      $27,000,000 (estimated)
3     $150,000,000 (estimated)
5                        error
6      $95,000,000 (estimated)
7                        error
8                        error
9                        error
10        $145,786 (estimated)
11      $6,400,000 (estimated)
12     $18,500,000 (estimated)
13     $70,000,000 (estimated)
14    $102,000,000 (estimated)
15    $115,000,000 (estimated)
16    $200,000,000 (estimated)
17     $13,000,000 (estimated)
18    $237,000,000 (estimated)
25      $8,100,000 (estimated)
26     $11,000,000 (estimated)
27     $17,500,000 (estimated)
28     $18,000,000 (estimated)
29     $35,000,000 (estimated)
30     $20,000,000 (estimated)
31     $40,000,000 (estimated)
32     $60,000,000 (estimated)
33      $6,000,000 (estimated)
34     $52,000,000 (estimated)
35     $80,000,000 (estimated)
36     $80,000,000 (estimated)
37    $123,000,000 (estimated)
38     $58,000,000 (estimated)
39     $

### 07.06.01 Clean Up Budget Data

In [814]:
# Remove 'estimated'
principals_history['budget_adj'] = principals_history['budget'].apply(lambda x: x[:-12] if "estimated" in x else x)
principals_history['budget_adj'].head(50)

0            error
1            error
2      $27,000,000
3     $150,000,000
5            error
6      $95,000,000
7            error
8            error
9            error
10        $145,786
11      $6,400,000
12     $18,500,000
13     $70,000,000
14    $102,000,000
15    $115,000,000
16    $200,000,000
17     $13,000,000
18    $237,000,000
25      $8,100,000
26     $11,000,000
27     $17,500,000
28     $18,000,000
29     $35,000,000
30     $20,000,000
31     $40,000,000
32     $60,000,000
33      $6,000,000
34     $52,000,000
35     $80,000,000
36     $80,000,000
37    $123,000,000
38     $58,000,000
39     $60,000,000
40     $88,000,000
41    $125,000,000
42    $150,000,000
43     $25,000,000
48    $100,000,000
49     $70,000,000
51     $38,000,000
52           error
53           error
54     $75,000,000
55    $275,000,000
58     $45,000,000
59           error
61           error
62         $10,000
63      $5,000,000
64     $40,000,000
Name: budget_adj, dtype: object

In [815]:
# Remove commas
principals_history['budget_adj'] = principals_history['budget_adj'].apply(lambda x: x.replace(",",""))
principals_history['budget_adj'] = principals_history['budget_adj'].apply(lambda x: x if "error" in x else x[1:])
principals_history['budget_adj'].head()

0        error
1        error
2     27000000
3    150000000
5        error
Name: budget_adj, dtype: object

In [816]:
# Remove 'error'
principals_history['budget_adj'] = principals_history['budget_adj'].apply(lambda x: 0 if "error" in x else x)
principals_history['budget_adj'].head()

0            0
1            0
2     27000000
3    150000000
5            0
Name: budget_adj, dtype: object

In [818]:
# Convert budget to integers
def convert(val):
    try:
        return(int(val))
    except:
        return(0)


principals_history['budget_adj'] = principals_history['budget_adj'].apply(lambda x: convert(x)) 

In [819]:
principals_history['budget_adj']

0               0
1               0
2        27000000
3       150000000
5               0
          ...    
7745     16937665
7750      8000000
7751      4900000
7754      8000000
7756     20000000
Name: budget_adj, Length: 6003, dtype: int64

### 07.06.02 Clean Up Wordlwide Revenue Data

In [824]:
# Perform the same cleanup on Worldwide, remove commas, remove errors, then convert
principals_history['worldwide_adj'] = principals_history['wordlwide']
principals_history['worldwide_adj'] = principals_history['worldwide_adj'].apply(lambda x: x.replace(",",""))
principals_history['worldwide_adj'] = principals_history['worldwide_adj'].apply(lambda x: x if "error" in x else x[1:])
principals_history['worldwide_adj'] = principals_history['worldwide_adj'].apply(lambda x: 0 if "error" in x else x)

def convert(val):
    try:
        return(int(val))
    except:
        return(0)


principals_history['worldwide_adj'] = principals_history['worldwide_adj'].apply(lambda x: convert(x)) 
principals_history['worldwide_adj']

0          469571
1          563162
2         2000853
3       449326618
5        46922870
          ...    
7745            0
7750      9074749
7751     10421847
7754      1561698
7756     62856743
Name: worldwide_adj, Length: 6003, dtype: int64

In [825]:
principals_history.head()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,runtimeMinutes,genres,newurl,mpaarating,rlsdt,budget,wordlwide,averageRating,numVotes,budget_adj,worldwide_adj
0,tt0113403,movie,A Midwinter's Tale,1995,99,Comedy,https://www.imdb.com/title/tt0113403,R,"February 16, 1996 (United States)",error,"$469,571",7.2,2577.0,0,469571
1,tt0450972,movie,As You Like It,2006,127,"Comedy,Drama,Romance",https://www.imdb.com/title/tt0450972,PG,"September 21, 2007 (United Kingdom)",error,"$563,162",6.1,3354.0,0,563162
2,tt0475331,movie,The Magic Flute,2006,135,"Drama,Musical,Romance",https://www.imdb.com/title/tt0475331,error,"December 13, 2006 (France)","$27,000,000 (estimated)","$2,000,853",6.5,1236.0,27000000,2000853
3,tt0800369,movie,Thor,2011,115,"Action,Adventure,Fantasy",https://www.imdb.com/title/tt0800369,PG-13,"May 6, 2011 (United States)","$150,000,000 (estimated)","$449,326,618",7.0,810857.0,150000000,449326618
5,tt12789558,movie,Belfast,2021,98,"Biography,Drama,History",https://www.imdb.com/title/tt12789558,PG-13,"November 12, 2021 (United States)",error,"$46,922,870",7.3,44239.0,0,46922870


## 07.07 Create ROI Calculation


In [831]:
principals_history['ROI'] = (principals_history.worldwide_adj - principals_history.budget_adj ) / principals_history.budget_adj

In [832]:
principals_history.head()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,runtimeMinutes,genres,newurl,mpaarating,rlsdt,budget,wordlwide,averageRating,numVotes,budget_adj,worldwide_adj,ROI
0,tt0113403,movie,A Midwinter's Tale,1995,99,Comedy,https://www.imdb.com/title/tt0113403,R,"February 16, 1996 (United States)",error,"$469,571",7.2,2577.0,0,469571,inf
1,tt0450972,movie,As You Like It,2006,127,"Comedy,Drama,Romance",https://www.imdb.com/title/tt0450972,PG,"September 21, 2007 (United Kingdom)",error,"$563,162",6.1,3354.0,0,563162,inf
2,tt0475331,movie,The Magic Flute,2006,135,"Drama,Musical,Romance",https://www.imdb.com/title/tt0475331,error,"December 13, 2006 (France)","$27,000,000 (estimated)","$2,000,853",6.5,1236.0,27000000,2000853,-0.925894
3,tt0800369,movie,Thor,2011,115,"Action,Adventure,Fantasy",https://www.imdb.com/title/tt0800369,PG-13,"May 6, 2011 (United States)","$150,000,000 (estimated)","$449,326,618",7.0,810857.0,150000000,449326618,1.995511
5,tt12789558,movie,Belfast,2021,98,"Biography,Drama,History",https://www.imdb.com/title/tt12789558,PG-13,"November 12, 2021 (United States)",error,"$46,922,870",7.3,44239.0,0,46922870,inf


## 07.08 Release Date


In [868]:
# Reformat Date
principals_history['rlsdt_dt'] = principals_history['rlsdt'].str.replace(r"\(.*\)","")

# Convert date format to get the date, month, and day information
principals_history['rlsdt_dt']= pd.to_datetime(principals_history['rlsdt_dt'],errors='coerce')
principals_history['rlsdt_mo']= pd.DatetimeIndex(principals_history['rlsdt_dt']).month
principals_history['rlsdt_day']= pd.DatetimeIndex(principals_history['rlsdt_dt']).day
principals_history['rlsdt_daynm']= pd.to_datetime(principals_history['rlsdt_dt']).dt.day_name()



  principals_history['rlsdt_dt'] = principals_history['rlsdt'].str.replace(r"\(.*\)","")


In [872]:
principals_history.head(5)

Unnamed: 0,tconst,titleType,primaryTitle,startYear,runtimeMinutes,genres,newurl,mpaarating,rlsdt,budget,wordlwide,averageRating,numVotes,budget_adj,worldwide_adj,ROI,rlsdt_dt,rlsdt_mo,rlsdt_day,rlsdt_daynm
0,tt0113403,movie,A Midwinter's Tale,1995,99,Comedy,https://www.imdb.com/title/tt0113403,R,"February 16, 1996 (United States)",error,"$469,571",7.2,2577.0,0,469571,inf,1996-02-16,2.0,16.0,Friday
1,tt0450972,movie,As You Like It,2006,127,"Comedy,Drama,Romance",https://www.imdb.com/title/tt0450972,PG,"September 21, 2007 (United Kingdom)",error,"$563,162",6.1,3354.0,0,563162,inf,2007-09-21,9.0,21.0,Friday
2,tt0475331,movie,The Magic Flute,2006,135,"Drama,Musical,Romance",https://www.imdb.com/title/tt0475331,error,"December 13, 2006 (France)","$27,000,000 (estimated)","$2,000,853",6.5,1236.0,27000000,2000853,-0.925894,2006-12-13,12.0,13.0,Wednesday
3,tt0800369,movie,Thor,2011,115,"Action,Adventure,Fantasy",https://www.imdb.com/title/tt0800369,PG-13,"May 6, 2011 (United States)","$150,000,000 (estimated)","$449,326,618",7.0,810857.0,150000000,449326618,1.995511,2011-05-06,5.0,6.0,Friday
5,tt12789558,movie,Belfast,2021,98,"Biography,Drama,History",https://www.imdb.com/title/tt12789558,PG-13,"November 12, 2021 (United States)",error,"$46,922,870",7.3,44239.0,0,46922870,inf,2021-11-12,11.0,12.0,Friday


In [874]:
principals_history.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6003 entries, 0 to 7756
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   tconst          6003 non-null   object        
 1   titleType       6003 non-null   object        
 2   primaryTitle    6003 non-null   object        
 3   startYear       6003 non-null   object        
 4   runtimeMinutes  6003 non-null   object        
 5   genres          6003 non-null   object        
 6   newurl          6003 non-null   object        
 7   mpaarating      6003 non-null   object        
 8   rlsdt           6003 non-null   object        
 9   budget          6003 non-null   object        
 10  wordlwide       6003 non-null   object        
 11  averageRating   6003 non-null   float64       
 12  numVotes        6003 non-null   float64       
 13  budget_adj      6003 non-null   int64         
 14  worldwide_adj   6003 non-null   int64         
 15  ROI 

Now that we have ROI and worldwide revenue for most films in the history, we need to summarize as best we can and get this data combined with Disney Feature Film data so that we can start building our models.    We could also get to runtime, month of release, day(date) and day of the week of release.  

## 07.09 Export Principal History


In [873]:
principals_history.to_csv('../Bens_Data/principals_history_updated.csv')