# Watch Me Code 2: Superhero Movies

https://raw.githubusercontent.com/mafudge/datasets/master/superhero/superhero-movie-dataset-1978-2012.csv

COLUMNS year, title, comic, imdb, rt, composite, opening_weeked_bo, avg_ticket_price, opening_weekend_attend, us_pop_that_year

- read_csv file from web
- no column names
- head(), sample()
- value_counts
- dealing with nulls
- Feature engineering


In [12]:
import pandas as pd
file = 'https://raw.githubusercontent.com/mafudge/datasets/master/superhero/superhero-movie-dataset-1978-2012.csv'
sh = pd.read_csv(file, header=None)
sh.sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
37,2009,Watchmen,DC,7.7,64,70.5,55214334.0,7.5,7361911.0,307006550
5,1986,Howard the Duck,Marvel,4.3,16,29.5,5070136.0,3.71,1366613.0,240132887
8,1989,The Return of Swamp Thing,DC,3.9,40,39.5,,3.97,,246819230
34,2008,The Incredible Hulk,Marvel,7.0,67,68.5,55414050.0,7.18,7717834.0,304374846
10,1992,Batman Returns,DC,7.0,78,74.0,45687711.0,4.15,11009090.0,255029699


In [13]:
# no columns? no sweat!
sh.columns = [ 'year', 'title', 'comic', 'imdb', 'rt', 'composite', 'opening_weeked_bo', 'avg_ticket_price', 'opening_weekend_attend', 'us_pop_that_year']
sh.head()

Unnamed: 0,year,title,comic,imdb,rt,composite,opening_weeked_bo,avg_ticket_price,opening_weekend_attend,us_pop_that_year
0,1978,Superman,DC,7.3,95,84.0,7465343.0,2.34,3190317.521,222584545
1,1980,Superman II,DC,6.7,88,77.5,14100523.0,2.69,5241830.112,227224681
2,1982,Swamp Thing,DC,5.3,60,56.5,,2.94,,231664458
3,1983,Superman III,DC,4.9,24,36.5,13352357.0,3.15,4238843.492,233791994
4,1984,Supergirl,DC,4.2,8,25.0,5738249.0,3.36,1707812.202,235824902


Once you have a data set loaded, you must understand the characteristics of the data.

In [15]:
## Who has more movies? DC or Marvel? use value_counts
sh['comic'].value_counts()

Marvel    29
DC        20
Name: comic, dtype: int64

In [16]:
## let's see that as a percentage of the total normalize=True
sh['comic'].value_counts(normalize=True)

Marvel    0.591837
DC        0.408163
Name: comic, dtype: float64

In [18]:
## what are the ratios in the last 10 years of data ?
sh[ sh['year'] > 2002 ]['comic'].value_counts(normalize=True) 

Marvel    0.741935
DC        0.258065
Name: comic, dtype: float64

In [19]:
# what about the first 10 years of data? 1978 - 1987?
sh[ sh['year'] <= 2002 ]['comic'].value_counts(normalize=True)

DC        0.666667
Marvel    0.333333
Name: comic, dtype: float64

Let's find the most popular opening weekend. we cannot use sales or attendance as the price of tickets and the population of the US changes each year. Therefore we will normalize attendance by us pop for that year to get a percentage of the US pop who saw the movie.

In [21]:
#nulls
sh = sh.dropna()
sh.head(10)

Unnamed: 0,year,title,comic,imdb,rt,composite,opening_weeked_bo,avg_ticket_price,opening_weekend_attend,us_pop_that_year
0,1978,Superman,DC,7.3,95,84.0,7465343.0,2.34,3190318.0,222584545
1,1980,Superman II,DC,6.7,88,77.5,14100523.0,2.69,5241830.0,227224681
3,1983,Superman III,DC,4.9,24,36.5,13352357.0,3.15,4238843.0,233791994
4,1984,Supergirl,DC,4.2,8,25.0,5738249.0,3.36,1707812.0,235824902
5,1986,Howard the Duck,Marvel,4.3,16,29.5,5070136.0,3.71,1366613.0,240132887
6,1987,Superman IV: The Quest for Peace,DC,3.6,10,23.0,5683122.0,3.91,1453484.0,242288918
7,1989,Batman,DC,7.6,71,73.5,40489746.0,3.97,10198930.0,246819230
10,1992,Batman Returns,DC,7.0,78,74.0,45687711.0,4.15,11009090.0,255029699
11,1995,Batman Forever,DC,5.4,42,48.0,52784433.0,4.35,12134350.0,262803276
12,1997,Batman & Robin,DC,3.6,12,24.0,42872605.0,4.59,9340437.0,267783607


In [None]:
## skip nulls in analysis


In [22]:
# feature engineering - 
sh['pct_of_pop'] = sh['opening_weekend_attend'] / sh['us_pop_that_year']
sh.sample(5)

Unnamed: 0,year,title,comic,imdb,rt,composite,opening_weeked_bo,avg_ticket_price,opening_weekend_attend,us_pop_that_year,pct_of_pop
17,2002,Spider-Man,Marvel,7.4,89,81.5,114844116.0,5.81,19766630.0,287803914,0.068681
44,2011,X-Men: First Class,Marvel,7.9,87,83.0,55101604.0,7.93,6948500.0,311591917,0.0223
46,2012,The Dark Knight Rises,DC,9.1,86,88.5,160887295.0,7.92,20314050.0,314055984,0.064683
41,2011,Captain America: The First Avenger,Marvel,6.8,79,73.5,65058524.0,7.93,8204101.0,311591917,0.02633
42,2011,Green Lantern,DC,5.9,27,43.0,53174303.0,7.93,6705461.0,311591917,0.02152


In [None]:
# Marvel comics with highest opening_weeked_bo


In [24]:
## completed data product
import pandas as pd
import IPython.display as display
file = 'https://raw.githubusercontent.com/mafudge/datasets/master/superhero/superhero-movie-dataset-1978-2012.csv'
sh = pd.read_csv(file, header=None)
sh.columns = [ 'year', 'title', 'comic', 'imdb', 'rt', 'composite', 'opening_weeked_bo', 'avg_ticket_price', 'opening_weekend_attend', 'us_pop_that_year']
sh = sh.dropna()
sh['pct_of_pop'] = sh['opening_weekend_attend'] /sh['us_pop_that_year']


pct = float(input("Enter a population percentage: e.g. 0.05: "))
print("Superhero movies with a greater opening box office attenance than:",pct)
results = sh[ sh['pct_of_pop'] >= pct ]
if results['title'].count() ==0:
    print("No movies to display")
else:
    display.display(results.sort_values('pct_of_pop', ascending=True))

Enter a population percentage: e.g. 0.05: 0.50
Superhero movies with a greater opening box office attenance than: 0.5
No movies to display
