## IMDb Animated Movie Exploratory Data Analysis

Author: **Michael B (MSB46)**

## Objective:

The purpose of this notebook is to visualize and make observations on any patterns relating to each of the columns within the recently cleaned dataframe. More feature engineering will be done based on observed patterns. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from PIL import Image

from datetime import date
import cpi
# cpi.update()

import re
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'pandas'

In [None]:
df = pd.read_csv("imdb_animated_movies_clean.csv")
df.head()

In [None]:
df.columns

In [None]:
df[{'rating','genres','production_companies','production_countries','languages','aspect_ratio'}].nunique().sort_values(
    ascending=False).plot.bar(figsize=(12,6))

plt.ylabel('Number of unique values')
plt.xlabel('Variables')
plt.tick_params(axis='x', rotation=45)
plt.title('Cardinality')

In [None]:
def check_boxoffice(b, p, w, n, o, r):
    if (b | p | w | n | o == -1) or "TV" in r:
        return 0
    return 1

In [None]:
df.insert(len(df.columns),'box_office',False)
df['box_office'] = df.apply(lambda x: check_boxoffice(x['budget_est_usd'], x['profit_usd'], 
                                                               x['worldwide_gross_usd'], x['na_gross_usd'], 
                                                               x['opening_weekend_usd'], x['rating']),axis=1)

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
num_df = df.select_dtypes(include=numerics)

In [None]:
num_df.columns

In [None]:
# Reference: https://seaborn.pydata.org/examples/many_pairwise_correlations.html 

sns.set_theme(style="white")

df_validBO = df[df['box_office'] == True]

corr = df_validBO.loc[:, df_validBO.columns != 'box_office'].corr()

mask = np.triu(np.ones_like(corr, dtype=bool))

f, ax = plt.subplots(figsize=(15, 10))

sns.heatmap(corr, mask=mask, cmap = 'viridis', center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, annot_kws={"fontsize":8})

# sns.heatmap(corr, annot = False, cmap = 'viridis')

# print(df['budget_est_usd'].corr(df['runtime_minutes']))
# print(df['worldwide_gross_usd'].corr(df['runtime_minutes']))
# print(df_validBO['budget_est_usd'].corr(df_validBO['runtime_minutes']))
# print(df_validBO['worldwide_gross_usd'].corr(df_validBO['runtime_minutes']))

### Year

In [None]:
df['year'].min()

When it comes to categorizing by era, there really isn't a solid, set-to-stone answer on when certain eras began and ended. For instance, Wikipedia refers the end of the golden era of animation to be around the late 1960s while other sources like TVTropes argue that it ended around the late 1950s. Though general consensus agrees that it ended during the second half of the 20th century.

**Golden Era**: Time period where theatrical cartoons were popular and many iconic characters got their stardom. (Steamboat Willie, Merrie Melodies)

**Dark Era**: When television became more reliant on cartoons for children by the 60s, animation became a lot more limited for these shows but not necessarily an indicatior that overall quality has dropped, especially for animated movies at the time. (A Boy Named Charlie Brown, Yogi Bear, The Flintstones, 101 Dalmations)

**Renaissance Era**: Often the era many associate the 'Disney Renaissance' with. Era with a noticeable return in technical quality. (The Little Mermaid, Aladdin, Akira)

**Millenennium Era**: The modern age of animation with reliance of new tools, applications and techniques uncommon before the 2000s (Shrek, Tangled, Spider-Man:Into the Spiderverse).

In [None]:
# Reference for Eras in Animation: https://tvtropes.org/pmwiki/pmwiki.php/Main/HistoryOfAnimation
def year_categorize(x):
    if x < 1960:
        return 'Golden Era'
    elif x >= 1960 and x < 1985:
        return 'Dark Era'
    elif x >= 1985 and x < 2000:
        return 'Renaissance Era'
    else:
        return 'Millennium Era'
    
df['year_period'] = df['year'].map(year_categorize)
temp = df.pop("year_period")
df.insert(3, "year_period", temp)

In [None]:
df.year_period.value_counts()

In [None]:
order = ['Millennium Era', 'Renaissance Era','Dark Era','Golden Era']
era_order = ['2000 ~', '1985-2000', '1960-1985', '~ 1960']
f, (ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
sns.boxplot(data=df ,x='year_period', y='votescore', ax=ax1, order=order)
ax1.title.set_text('Votescore by Era')
# ax1.tick_params(axis='x', rotation=15)
ax1.set_xticks([0, 1, 2, 3], era_order)

sns.boxplot(data=df[df.metacritic > 0] ,x='year_period', y='metacritic',ax=ax2, order=order)
ax2.title.set_text('Metacritic by Era')
# ax2.tick_params(axis='x', rotation=15)
ax2.set_xticks([0, 1, 2, 3], era_order)


f.savefig('graphs/vote_meta_by_era.png', bbox_inches='tight')
plt.show()

### Runtime

In [None]:
order = sorted(df.rating.unique())

In [None]:
sns.set_theme(palette="muted")

f, (ax1,ax2,ax3) = plt.subplots(1,3, figsize=(15, 4))

sns.boxplot(data=df, y='runtime_minutes', x ='rating', ax=ax1, order=order, color='r')
ax1.title.set_text('Runtime by Rating')
ax1.tick_params(axis='x', rotation=90)

ax2.title.set_text('Runtime of all movies')
sns.boxplot(data=df, x='runtime_minutes', ax=ax2)

ax3.title.set_text('Runtime of all movies with box office info')
sns.boxplot(x=df_validBO['runtime_minutes'], ax=ax3)

### MPAA Rating

In [None]:
# Runtime within rating
f, (ax1,ax2) = plt.subplots(1,2,figsize=(15,5))

sns.boxplot(data=df, y='votescore', x='rating',ax=ax1, order=order, color='b')
ax1.title.set_text('Votescore by Rating')
ax1.tick_params(axis='x', rotation=90)

sns.boxplot(data=df[df['metacritic'] > -1], y='metacritic', x='rating',ax=ax2,
            order=sorted(df[df['metacritic'] > -1].rating.unique()), color='g')
ax2.title.set_text('Metacritic by Rating')
ax2.tick_params(axis='x', rotation=90)

f.savefig('graphs/vote_meta_by_rating.png', bbox_inches='tight')
plt.show()

In [None]:
round(pd.pivot_table(df, values=['votescore','metacritic'], index='rating',aggfunc=[np.mean],
                    fill_value=0),2)

#### Shortest and longest movie

In [None]:
shortest_movie = df['runtime_minutes'].idxmin()
df.iloc[shortest_movie][{'name','year','budget_est_usd','runtime_minutes','genres'}]

In [None]:
longest_movie = df['runtime_minutes'].idxmax()
df.iloc[longest_movie][{'name','year','budget_est_usd','runtime_minutes','genres'}]

#### Shortest and longest movie in box office

In [None]:
shortest_movie = df_validBO['runtime_minutes'].idxmin()
df.iloc[shortest_movie][{'name','year','budget_est_usd','runtime_minutes','genres'}]

In [None]:
longest_movie = df_validBO['runtime_minutes'].idxmax()
df.iloc[longest_movie][{'name','year','budget_est_usd','runtime_minutes','genres'}]

### Genre

In [None]:
df['genres']

In [None]:
import itertools

words = set(df['genres'].str.findall("[\w.\-]+").sum())
genres = set(words)
# g_list = list(genres)

In [None]:
print(list(genres))

In [None]:
genre_dict = {}
for x in genres:
    genre_dict[x] = 0
for x in df['genres']:
    for y in x.split(', '):
        genre_dict[y] += 1

In [None]:
genre_dict

In [None]:
df[{'genres','votescore'}].head(5)

In [None]:
def get_boxoffice_info_by_keys(df, column, keys, auto_calc=True):
    gross_by_col = {}
    budget_by_col = {}
    
    counter = {}
    
    for x in range(len(df[column])):
        if df['box_office'][x] < 1:
            continue
        for k in keys:
            if k in df[column][x].split(', '):
                counter[k] = counter.get(k,0)
                
                gross_val = gross_by_col.get(k,0)
                budget_val = budget_by_col.get(k,0)

                gross_by_col[k] = gross_val + cpi.inflate(df["worldwide_gross_usd"][x], df["year"][x])
                budget_by_col[k] = budget_val + cpi.inflate(df["budget_est_usd"][x], df["year"][x])
                
                
                counter[k] = counter.get(k) + 1
                
    for k in keys:
        if k in gross_by_col:
            if auto_calc:
                gross_by_col[k] = round(df[df[column].str.contains(k)]['worldwide_gross_usd'].mean(),3)
                budget_by_col[k] = round(df[df[column].str.contains(k)]['budget_est_usd'].mean(),3)
            else:
                gross_by_col[k] = (gross_by_col.get(k) / counter.get(k,1)).round(3)
                budget_by_col[k] = (budget_by_col.get(k) / counter.get(k,1)).round(3)
        else:
            continue

    
    return gross_by_col, budget_by_col, counter


keys = list(genre_dict.keys())
keys.sort()
vals = [genre_dict[k] for k in keys]

genre_avg_gross, genre_avg_budget, cnt = get_boxoffice_info_by_keys(df, 'genres', keys, False)

avg_g = [genre_avg_gross[k] for k in keys]
count_g = [cnt[k] for k in keys]
avg_b = [genre_avg_budget[k] for k in keys]

data_bo_genre = {'genre':keys, 'count':vals, 'avg_gross':avg_g, 'avg_budget':avg_b}
df_bo_genre = pd.DataFrame(data=data_bo_genre)

In [None]:
df_bo_genre['avg_gross'] = df_bo_genre['avg_gross'].apply(lambda x: round(x/10_000_000,2))
df_bo_genre['avg_budget'] = df_bo_genre['avg_budget'].apply(lambda x: round(x/10_000_000,2))

df_bo_genre['avg_profit'] = df_bo_genre['avg_gross'] - df_bo_genre['avg_budget']

In [None]:
df_bo_genre

In [None]:
# Reference for Horizontal Graph Labels: https://stackoverflow.com/a/51410758

data_gbo = df_bo_genre.sort_values(["avg_gross"], ascending=True)

ax = data_gbo[{'genre','avg_gross','avg_budget'}].plot.barh(
    figsize=(15,14), 
    edgecolor = '.75',
    color=['salmon','lightgreen'],
    title='Average Gross and Budgets by Genre',
    x='genre')

plt.gcf().get_axes()
plt.xticks([0, 1, 2, 3, 4, 5], ['0', '10', '20', '30', '40', '50'])

plt.xlabel('USD (Million)')


rects = ax.patches

labels = [f"{x} films" for x in data_gbo['count']]

t = 0
for rect in rects: 
    w = rect.get_width()
    h = rect.get_y() + rect.get_height() / 2

    try:
        plt.annotate(
            labels[t],                
            (w, h),                   
            xytext=(5, 0),            
            textcoords="offset points",
            va='center'
        )             
                                      
    except IndexError:
        continue
    
    t+=1

plt.savefig('graphs/avg_gross_budget_by_genre.png', bbox_inches='tight')
plt.show()

---
### Visualizing Votescore and Metacritic by Genre

In [None]:
def avg_score_by_keys(df, column, keys, auto_calc=True):
    score_by_col = {}
    mc_by_col = {}
    counter_a, counter_b  = {}, {}
    
    for x in range(len(df[{column,'votescore','metacritic'}])):
        for k in keys:
            if k in df[column][x].split(', '):
                counter_a[k] = counter_a.get(k,0)
                
                val = score_by_col.get(k,0)

                score_by_col[k] = val + df['votescore'][x]
                counter_a[k] = counter_a.get(k) + 1
#                 print(f'{counter_a[k]} - {k}')
                if df['metacritic'][x] < 0:
                    continue
                else:
                    val_m = mc_by_col.get(k,0)
                    counter_b[k] = counter_b.get(k,0)
                    mc_by_col[k] = val_m + df['metacritic'][x]
                    counter_b[k] = counter_b.get(k) + 1

#     print(f'val:{val}, counter_a:{counter_a}')
#     print(mc_by_col)
    for k in keys:
        if auto_calc:
            score_by_col[k] = round(df[df[column].str.contains(k)]['votescore'].mean(),3)
        else:
            score_by_col[k] = (score_by_col.get(k) / counter_a[k]).round(3)

        
        if counter_b.get(k) is not None and mc_by_col.get(k) is not None:
            if auto_calc:
                mc_by_col[k] = round(df[df['production_countries'].str.contains(k)]['votescore'].mean(),3)
            else:
                mc_by_col[k] = round(mc_by_col.get(k,-1) / counter_b.get(k,1),3)
    
    return score_by_col, mc_by_col, counter_b

In [None]:
keys = list(genre_dict.keys())
keys.sort()
vals = [genre_dict[k] for k in keys]

genre_avg_score, genre_avg_mc, count_b = avg_score_by_keys(df, 'genres', keys, False)

avgs_s = [genre_avg_score[k] for k in keys]
count_mc = [count_b[k] for k in keys]
avgs_mc = [genre_avg_mc[k] for k in keys]

In [None]:
data_genre = {'genre':keys, 'count':vals, 'avg_score':avgs_s, 'avg_metacritic':avgs_mc, 'count_mc':count_mc}
df_genre = pd.DataFrame(data=data_genre)

f,ax = plt.subplots(figsize=(15,10))

ax = sns.barplot(data = df_genre.sort_values("count", ascending=False), y="genre", x="count")

plt.xlabel('Count')
plt.ylabel("Genre")
plt.title('Movie Count by Genres')

plt.savefig('graphs/genre_count.png', bbox_inches='tight')
plt.show()

In [None]:
sns.boxplot(data=df, y="votescore")

In [None]:
plt.figure(figsize = (15,8))
sorted_data_genre = df_genre.sort_values("avg_score", ascending=False)

pal = sns.diverging_palette(190, 30, s=70, l=85, n=35)

ax = sns.barplot(data = sorted_data_genre, x="genre", y="avg_score", edgecolor = '.2',palette=pal,dodge=False)
# ax.get_legend().remove()

plt.xlabel('Genre')
plt.ylabel("Average Score")
plt.title('Average Score by Genres')

labels = [f"\n{x}\nfilms" for x in sorted_data_genre['count']]
plt.bar_label(ax.containers[-1],labels, label_type='center')

plt.tick_params(axis='x', rotation=45)
plt.show()

In [None]:
plt.figure(figsize = (15,10))
sorted_data_genre = df_genre.sort_values("avg_metacritic", ascending=False)

pal = sns.diverging_palette(140, 20, l=75, s=90, sep=3, n=30)

ax2 = sns.barplot(data = sorted_data_genre, x="genre", y="avg_metacritic", edgecolor = '.1', color="whitesmoke",palette=pal, dodge=False)
# ax2.get_legend().remove()
plt.xlabel('Genre')
plt.ylabel("Average Metacritic Score")
plt.title('Average Metacritic Score by Genres')

labels = [f"\n{x}\nfilms" for x in sorted_data_genre['count_mc']]
plt.bar_label(ax2.containers[-1], labels = labels, label_type='center')

plt.tick_params(axis='x', rotation=45)
plt.show()

In [None]:
data_g = df_genre.sort_values("avg_score", ascending=False).head(20)

data_g[{'genre','avg_score','avg_metacritic'}].plot.bar(
    figsize=(15,14), 
    width=0.7,
    color=['darkgray','seagreen'],
    title='Average IMDb and Metacritic Scores by Genre',
    secondary_y= 'avg_metacritic')
ax1, ax2 = plt.gcf().get_axes() # gets the current figure and then the axes

ax1.set_ylim([0, 9])
ax2.set_ylim([0, 90])

ax1.set_ylabel("IMDb User Score")
ax2.set_ylabel("Metacritic Score")

ax1.legend(["Votescore"])
ax2.legend(["Metacritic"], loc="upper left")

x_plot = list(data_g['genre'])
plt.xticks(np.arange(len(data_g)),x_plot)

plt.savefig('graphs/score_rating_by_genre.png', bbox_inches='tight')
plt.show()

In [None]:
genre_misc = df.groupby(df[df['metacritic'] != -1]['genres']).mean()

In [None]:
genre_misc[{'votescore','metacritic'}].sort_values('votescore',ascending=False)

In [None]:
genre_misc[{'votescore','metacritic'}].sort_values('metacritic',ascending=False)

In [None]:
for g in df_genre['genre']:
#     print(g)
    df[f'genre_{g.lower()}'] = df['genres'].apply(lambda x: 1 if g.lower() in x.lower() else 0)
    
    
# Hybrid Genres:
### Romatic Comedy, Romantic Drama, Dramedy, Action Adventure, Crime Thriller, Horror Thriller, Crime Drama, Thriller Drama

df['hybrid_romantic_comedy'] = df['genres'].apply(lambda x: 1 if 'romance' and 'comedy' in x.lower() else 0)
df['hybrid_romantic_drama'] = df['genres'].apply(lambda x: 1 if 'romance' and 'drama' in x.lower() else 0)
df['hybrid_dramedy'] = df['genres'].apply(lambda x: 1 if 'romance' and 'comedy' in x.lower() else 0)
df['hybrid_action_adventure'] = df['genres'].apply(lambda x: 1 if 'action' and 'adventure' in x.lower() else 0)
df['hybrid_crime_thriller'] = df['genres'].apply(lambda x: 1 if 'crime' and 'thriller' in x.lower() else 0)
df['hybrid_horror_thriller'] = df['genres'].apply(lambda x: 1 if 'horror' and 'comedy' in x.lower() else 0)
df['hybrid_thriller_drama'] = df['genres'].apply(lambda x: 1 if 'thriller' and 'drama' in x.lower() else 0)
df['hybrid_thriller_mystery'] = df['genres'].apply(lambda x: 1 if 'thriller' and 'mystery' in x.lower() else 0)
df['hybrid_family_comedy'] = df['genres'].apply(lambda x: 1 if 'family' and 'comedy' in x.lower() else 0)
df['hybrid_family_adventure'] = df['genres'].apply(lambda x: 1 if 'family' and 'adventure' in x.lower() else 0)

### Genre Count (in other words, how many genres does a movie typically have?)

In [None]:
sns.countplot(y=num_df['genre_count'])

In [None]:
sns.boxplot(x=df['story_word_count'])

In [None]:
sns.swarmplot(y = df['story_word_count'], x = df['genre_count'])

In [None]:
sns.swarmplot(y = df['votescore'], x = df['genre_count'])

In [None]:
df[{'genres','votescore'}].head(5)

# NEXT STEP: Story description

In [None]:
from wordcloud import WordCloud

In [None]:
stopwords = ''
with open("stop_words_english.txt") as t:
    stopwords = ' '.join(line for line in t).replace("\n","")

stopwords = stopwords.split()
stopwords.append('-')
stopwords.append('story-missing')

In [None]:
story_words = df['story_desc'].str.lower().str.findall("[\w.\-']+").sum()

In [None]:
text = ' '.join(str(x) for x in story_words)

In [None]:
img_mask = np.array(Image.open("image_mask.png"))

In [None]:
def getFrequencyDict(words):
    temp = dataDict = {}

    # Dictionary counts frequencies
    for w in words:
        if w in stopwords:
            continue
        val = temp.get(w, 0)
        temp[w.lower()] = val + 1
    for key in temp:
        dataDict[key] = temp[key]
    
    return dataDict

In [None]:
freq_dict = getFrequencyDict(story_words)

In [None]:
w_keys = list(freq_dict.keys())
w_vals = [freq_dict[k] for k in w_keys]
    
data_wordfreq = {'word':w_keys, 'freq':w_vals}
df_wordfreq = pd.DataFrame(data=data_wordfreq, index=None)

In [None]:
common_words = df_wordfreq[df_wordfreq['freq'] > 10].sort_values("freq", ascending=False).head(15).style.hide_index()
common_words

In [None]:
df['desc_girl'] = df['story_desc'].apply(lambda x: 1 if 'girl' in x.lower() else 0)
df['desc_boy'] = df['story_desc'].apply(lambda x: 1 if 'boy' in x.lower() else 0)
df['desc_young'] = df['story_desc'].apply(lambda x: 1 if 'young' in x.lower() else 0)
df['desc_family'] = df['story_desc'].apply(lambda x: 1 if 'family' in x.lower() else 0)
df['desc_friend'] = df['story_desc'].apply(lambda x: 1 if 'friend' in x.lower() else 0)
df['desc_named'] = df['story_desc'].apply(lambda x: 1 if 'named' in x.lower() else 0)
df['desc_save'] = df['story_desc'].apply(lambda x: 1 if 'save' in x.lower() else 0)
df['desc_evil'] = df['story_desc'].apply(lambda x: 1 if 'evil' in x.lower() else 0)
df['desc_life'] = df['story_desc'].apply(lambda x: 1 if 'life' in x.lower() else 0)
df['desc_man'] = df['story_desc'].apply(lambda x: 1 if 'man ' in x.lower() else 0)
df['desc_city'] = df['story_desc'].apply(lambda x: 1 if 'city' in x.lower() else 0)

used_common_word_list = ["girl","boy","young","family","friend","named","save","evil","life","man","city"]

df.insert(len(df.columns),'desc_noa',-1)
for w in range(len(df)):
    flag = False
    for x in used_common_word_list:
        if df[f'desc_{x}'][w] == 1:
            flag = True
            
    if flag:
        df['desc_noa'][w] = 0
    else:
        df['desc_noa'][w] = 1

df['long_story_desc'] = df['story_word_count'].apply(lambda x: 1 if x >= 30 else 0)

In [None]:
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='black',
                stopwords = stopwords,
                min_font_size = 12,
                margin = 11,
                mask=img_mask,
                contour_width=.5, contour_color='lightgray').generate(text)

In [None]:
import random
def grey_color_func(**kwargs):
    return "hsl(0, 0%%, %d%%)" % random.randint(75, 100)

In [None]:
plt.figure(figsize = (15, 8), facecolor = "black")
plt.imshow(wordcloud.recolor(color_func=grey_color_func))
plt.axis("off")

# plt.title("Animated Movie Story Description Word Cloud",color='white')

plt.figtext(.35, 1, 'Animated Movie Story Description Word Cloud', ha='center', va='center',color='white',fontsize=17)

idea_quote = '"Get a good idea and stay with it. Dog it, \nand work at it until itâ€™s done right."\n\n- Walt Disney'
plt.figtext(.23, .5, idea_quote, ha='center', va='center',color='white',fontsize=14)

plt.annotate("All descriptions are taken from the story description of each IMDb page",
            xy=(60, 25), xycoords='figure pixels', color='white',fontsize=10)

wordcloud.to_file("graphs/basic_wordcloud.png")
plt.savefig('graphs/story_wordcloud.png', bbox_inches='tight')
plt.tight_layout(pad = 0)


plt.show()

### Production

In [None]:
df['production_companies']

In [None]:
def getFreq(data):
    temp = dataDict = {}

    # Dictionary counts frequencies
    for c in data:
        for x in c.split(', '):
            val = temp.get(x.strip(), 0)
            temp[x] = val + 1
    for key in temp:
        dataDict[key] = temp[key]
    
    return dataDict

In [None]:
companies = getFreq(df['production_companies'])
# print(companies)

---
### Score Based on Production Companies

In [None]:
c_keys = list(companies.keys())
c_keys.sort()
c_vals = [companies[k] for k in c_keys]
    
comp_avg_score, comp_avg_mc, comp_b = avg_score_by_keys(df, 'production_companies', c_keys, False)

avgs_s = [comp_avg_score[k] for k in c_keys]
count_mc = [comp_b.get(k,0) for k in c_keys]
avgs_mc = [comp_avg_mc.get(k,-1) for k in c_keys]

# data_genre = {'genre':keys, 'count':vals, 'avg_score':avgs_s, 'avg_metacritic':avgs_mc, 'count_mc':count_mc}
# df_genre = pd.DataFrame(data=data_genre)
    
data_company = {'company':c_keys, 'freq':c_vals, 'avg_score':avgs_s, 'avg_metacritic':avgs_mc, 'count_mc':count_mc}
df_company = pd.DataFrame(data=data_company, index=None)

In [None]:
df_company = df_company.drop_duplicates(subset=['company'], keep="first")
df_company_b = df_company.query('count_mc > 3 & freq > 5')

In [None]:
df_company_b.sort_values("freq", ascending=False).head(10).style

In [None]:
plt.figure(figsize = (15,10))
ax = sns.barplot(data = df_company.sort_values("freq", ascending=False).head(20), x="freq", y="company")

plt.xlabel('Count')
plt.ylabel("Company")
plt.title('Movie Count by Production Company')
# plt.tick_params(axis='x', rotation=90)
plt.show()

In [None]:
plt.figure(figsize = (15,12))
sorted_data_comp = df_company_b.sort_values("avg_score", ascending=False).head(20)

pal = sns.diverging_palette(170, 90, l=85, n=40)

ax_c = sns.barplot(data = sorted_data_comp, y="company", x="avg_score", edgecolor = '.5', palette=pal)
plt.xlabel('Votescore average')
plt.ylabel("Production Company")
plt.title('Top Vote Score Averages by Production Companies')

labels_a = [f"{x} films" for x in sorted_data_comp['freq']]
labels_b = [f" {round(b,1)}" for b in sorted_data_comp['avg_score']]

plt.bar_label(ax_c.containers[-1], labels = labels_a, label_type='center')
plt.bar_label(ax_c.containers[-1], labels = labels_b)

plt.show()

In [None]:
plt.figure(figsize = (15,15))
sorted_data_comp = df_company_b.sort_values("avg_metacritic", ascending=False).head(20)

pal = sns.diverging_palette(140, 100, l=75, s=90, n=40)

ax2 = sns.barplot(data = sorted_data_comp, y="company", x="avg_metacritic", edgecolor = '.5', palette=pal)
plt.ylabel('Company')
plt.xlabel("Average Metacritic Score")
plt.title('Average Metacritic Score by Company')

labels_a = [f"{a} films " for a in sorted_data_comp['count_mc']]
labels_b = [f" {round(b,1)}" for b in sorted_data_comp['avg_metacritic']]

plt.bar_label(ax2.containers[-1], labels = labels_a, label_type='center')
plt.bar_label(ax2.containers[-1], labels = labels_b)

# plt.tick_params(axis='x', rotation=45)
plt.savefig('graphs/metacritic_by_company.png', bbox_inches='tight')
plt.show()

In [None]:
data = df_company[df_company["avg_metacritic"] > -1].sort_values("freq", ascending=False).head(20)
data = data.sort_values("avg_score", ascending=False).head(20)

data[{'company','avg_score','avg_metacritic'}].plot.bar(
    figsize=(15,10), 
    width=0.7,
    color=['darkgray','seagreen'],
    title='Average IMDb/Metacritic Scores by Top 20 Most Frequent Production Companies',
    secondary_y='avg_metacritic')

ax1, ax2 = plt.gcf().get_axes() 

ax1.set_ylim([0, 9])
ax2.set_ylim([0, 90])

ax1.set_ylabel("IMDb User Score")
ax2.set_ylabel("Metacritic Score")

ax1.legend(["Votescore"])
ax2.legend(["Metacritic"], loc="upper left")

x_plot = list(data['company'])
plt.xticks(np.arange(20),x_plot)

plt.show()

In [None]:
major_comps = ['disney', 'pixar', 'dreamworks', 'columbia', 'dentsu', 'sony', 'universal', 'fox', 'ghibli', 'paramount']

In [None]:
df['company_disney'] = df['production_companies'].apply(lambda x: 1 if 'disney' in x.lower() else 0)
df['company_dreamworks'] = df['production_companies'].apply(lambda x: 1 if 'dreamworks' in x.lower() else 0)
df['company_pixar'] = df['production_companies'].apply(lambda x: 1 if 'pixar' in x.lower() else 0)
df['company_columbia'] = df['production_companies'].apply(lambda x: 1 if 'columbia pictures' in x.lower() else 0)
df['company_dentsu'] = df['production_companies'].apply(lambda x: 1 if 'dentsu' in x.lower() else 0)
df['company_sony'] = df['production_companies'].apply(lambda x: 1 if 'sony' in x.lower() else 0)
df['company_universal'] = df['production_companies'].apply(lambda x: 1 if 'universal' in x.lower() else 0)
df['company_fox'] = df['production_companies'].apply(lambda x: 1 if 'fox' in x.lower() else 0)
df['company_ghibli'] = df['production_companies'].apply(lambda x: 1 if 'studio ghibli' in x.lower() else 0)
df['company_paramount'] = df['production_companies'].apply(lambda x: 1 if 'paramount' in x.lower() else 0)

df.insert(len(df.columns),'company_other',-1)
for w in range(len(df)):
    flag = False
    for x in major_comps:
        if df[f'company_{x}'][w] == 1:
            flag = True
            
    if flag:
        df['company_other'][w] = 0
    else:
        df['company_other'][w] = 1


In [None]:
df[df['company_other'] == 0][{'production_companies','company_other'}].head(10)

In [None]:
df[df['company_other'] == 1][{'production_companies','company_other'}].head(5)

### Aspect ratio

In [None]:
df['aspect_ratio'] = df['aspect_ratio'].str.replace(" ", "")

In [None]:
sns.barplot(y=df['aspect_ratio'].value_counts().index, x=df['aspect_ratio'].value_counts())
plt.tick_params(axis='x', rotation=0)
plt.show()

# NEXT STEP: Work more on countries

In [None]:
df['production_countries'].unique

In [None]:
def getFreqCountry(data):
    temp = dataDict = {}

    # Dictionary counts frequencies
    for c in data:
        for x in c.split(', '):
            if "germany" in x.lower():
                val = temp.get("Germany", 0)
                temp["Germany"] = val + 1
                continue
                
            val = temp.get(x.strip(), 0)
            temp[x] = val + 1
    for key in temp:
        dataDict[key] = temp[key]
    
    return dataDict

In [None]:
countries = getFreqCountry(df['production_countries'])

In [None]:
countries

In [None]:
# countries['Germany'] += countries.get("West Germany")
# del countries['West Germany']

In [None]:
country_keys = list(countries.keys())
country_keys.sort()

country_vals = [countries[k] for k in country_keys]
    
country_avg_score, country_avg_mc, count_c = avg_score_by_keys(df, 'production_countries', country_keys)

# Merging various countries:
# country_avg_score['Czechoslovakia'] = round(df[df['production_countries'].str.contains("Czech")]['votescore'].mean(),3)
# country_avg_mc['Czechoslovakia'] = round(df[df['metacritic'] > 0][df['production_countries'].str.contains("Czech")]['metacritic'].mean(),3)
# count_c['Czechoslovakia'] = len(df[df['metacritic'] > 0][df['production_countries'].str.contains("Czech")]['metacritic'])

avgs_s = [country_avg_score[k] for k in country_keys]
count_mc = [count_c.get(k,0) for k in country_keys]
avgs_mc = [country_avg_mc.get(k,-1) for k in country_keys]
    
data_country = {'country':country_keys, 'freq':country_vals, 'avg_score': avgs_s, 'avg_metacritic': avgs_mc, 'count_mc': count_mc}
df_country = pd.DataFrame(data=data_country, index=None)

In [None]:
df_country = df_country.drop_duplicates(subset=['country'], keep="first")

In [None]:
sns.barplot(data=df_country.sort_values("freq", ascending=False).head(15),y='country',x='freq')

In [None]:
plt.figure(figsize = (15,12))
sorted_data_country = df_country.sort_values("avg_score", ascending=False)

pal = sns.diverging_palette(170, 10, l=85, s=120, sep=10, n=55)

ax_ctry = sns.barplot(data = sorted_data_country, y="country", x="avg_score", edgecolor = '.5', palette=pal)
plt.xlabel('Votescore Average')
plt.ylabel("Production Country")
plt.title('Top Vote Score Averages by Production Countries')

labels_a = [f"{x} film(s)" for x in sorted_data_country['freq']]
labels_b = [f" {round(b,1)}" for b in sorted_data_country['avg_score']]

plt.bar_label(ax_ctry.containers[-1], labels = labels_a, label_type='center')
plt.bar_label(ax_ctry.containers[-1], labels = labels_b)

plt.savefig('graphs/score_average_by_country.png', bbox_inches='tight')
plt.show()

In [None]:
plt.figure(figsize = (15,12))
sorted_data_country = df_country.sort_values("avg_metacritic", ascending=False).query('avg_metacritic > 0')

pal = sns.diverging_palette(140, 10, l=85, s=120, sep=10, n=55)

ax_ctry = sns.barplot(data = sorted_data_country, y="country", x="avg_metacritic", edgecolor = '.5', palette=pal)
plt.xlabel('Votescore average')
plt.ylabel("Production Country")
plt.title('Metacritic Score Averages by Countries')

labels_a = [f"{x} film(s)" for x in sorted_data_country['count_mc']]
labels_b = [f" {round(b,1)}" for b in sorted_data_country['avg_metacritic']]

plt.bar_label(ax_ctry.containers[-1], labels = labels_a, label_type='center')
plt.bar_label(ax_ctry.containers[-1], labels = labels_b)

plt.show()

# Continent

In [None]:
# Source: https://stackoverflow.com/questions/55910004/get-continent-name-from-country-using-pycountry
import pycountry_convert as pc

def country_to_continent(country_name):
    try:
        if "czech" in country_name.lower():
            country_name = "Czechia"
        if "serbia" in country_name.lower():
            country_name = "Serbia"
        if "germany" in country_name.lower():
            country_name = "Germany"
        country_alpha2 = pc.country_name_to_country_alpha2(country_name)
    except KeyError:
        country_alpha2 = pc.country_name_to_country_alpha2(country_name.split()[0])
    country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
    country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
    return country_continent_name

df_country['continent'] = df_country['country'].apply(country_to_continent)

In [None]:
df_country[{'country','continent','freq','avg_score'}].groupby('continent').avg_score.mean().round(1)

In [None]:
plt.figure(figsize = (8,8))

pal = sns.diverging_palette(200, 350, l=65, s=100, center='light', n=4)


sns.boxplot(data=df_country, y='continent', x='avg_score', palette=pal)
plt.xlabel('Average Score Between Metacritic and IMDb Votescore')
plt.ylabel("Continent")
plt.title('Average Animated Movie Scores by Continents')

plt.show()

In [None]:
def make_df(data, counted_col, count_by_one=True, y="freq",):
    temp = dataDict = {}

    # Dictionary counts frequencies
    if count_by_one:
        for c in data:
            try:
                for x in c.split(', '):
                    if x == "":
                        continue
                    val = temp.get(x.strip(), 0)
                    temp[x] = val + 1
            except AttributeError:
                    val = temp.get("Uncredited", 0)
                    temp[x] = val + 1
        for key in temp:
            dataDict[key] = temp[key]
    else:
        for i in range(len(data)):
            val = temp.get(data[counted_col][i], 0)
            temp[data[counted_col][i]] = val + data[y][i]
        for key in temp:
            dataDict[key] = temp[key]

    keys = list(dataDict.keys())
    vals = [dataDict[k] for k in keys]

    data_final = {counted_col:keys, y:vals}
    df_final = pd.DataFrame(data=data_final, index=None)
    
    return df_final

In [None]:
df_cont = make_df(df_country, 'continent', False)

In [None]:
df_cont

In [None]:
sns.barplot(data=df_cont.sort_values("freq",ascending=False), x='continent', y='freq')

In [None]:
def check_continents(data):
    cont_list = set()
    for x in data.split(', '):
#         print(x)
        val = country_to_continent(x)
#         print(val)
        cont_list.add(val) 
    return cont_list

In [None]:
df['continent_namerica'] = df['production_countries'].apply(lambda x: 1 if "North America" in check_continents(x) else 0)
df['continent_europe'] = df['production_countries'].apply(lambda x: 1 if "Europe" in check_continents(x) else 0)
df['continent_asia'] = df['production_countries'].apply(lambda x: 1 if "Asia" in check_continents(x) else 0)
df['continent_oceania'] = df['production_countries'].apply(lambda x: 1 if "Oceania" in check_continents(x) else 0)

In [None]:
df.head(5)

### Language and Country

In [None]:
plt.figure(figsize = (10,8))
sns.violinplot(data=df, x="country_count",
                y="language_count", palette="muted")

In [None]:
sns.countplot(data=df,x="language_count")

In [None]:
df_lang = make_df(df['languages'], 'languages', True)

In [None]:
plt.figure(figsize = (10,4))
sns.barplot(data=df_lang.sort_values("freq",ascending=False).head(10),y='languages',x='freq')

In [None]:
df_lang.sort_values('freq',ascending=False).head(10).value_counts

In [None]:
df['lang_english'] = df['languages'].apply(lambda x: 1 if "English" in x else 0)
df['lang_japanese'] = df['languages'].apply(lambda x: 1 if "Japanese" in x else 0)
df['lang_french'] = df['languages'].apply(lambda x: 1 if "French" in x else 0)
df['lang_spanish'] = df['languages'].apply(lambda x: 1 if "Spanish" in x else 0)
df['lang_german'] = df['languages'].apply(lambda x: 1 if "German" in x else 0)

# Director / Writers

In [None]:
sns.countplot(data=df.sort_values("director_count",ascending=False), x='director_count')
plt.title('Typical Director Count in Animated Movies')
plt.ylabel('Movie Count')
plt.xlabel('Director Involvement')
plt.show()

In [None]:
sns.countplot(data=df.sort_values("writer_count",ascending=False), x='writer_count')
plt.title('Typical Writer Count in Animated Movies')
plt.ylabel('Movie Count')
plt.xlabel('Writer Involvement')
plt.show()

In [None]:
try:
    df.insert(len(df.columns),'written_by_director',-1)
except ValueError:
    print("Column probably already exists")

def a_in_b(df,col_a,col_b,new_col):
    for x in range(len(df)):
        for a in df[col_a][x].split(", "):
            if df[new_col][x] == 1:
                continue
            if a.strip() in df[col_b][x].split(", "):
                df[new_col][x] = 1
            else:
                df[new_col][x] = 0
#                 print(f'{a} {df[col_b][x].split(", ")}')
            
        
a_in_b(df,'writers','directors','written_by_director')

In [None]:
df_dir = make_df(df['directors'], 'director', True)

In [None]:
df_wri = make_df(df['writers'], 'writer', True)

In [None]:
df_dir

In [None]:
df.columns.get_loc('directors')

In [None]:
# Go through df_orig -> For each row, check freq number (from df_freq) for each director involved. 

# Example: Ron Clements had a director's role for 7 movies, John Musker had a directing role for 6.
# Therefore, if Clements and Musker both have a directing for Aladdin, then the movie has 7 + 6 frequency points

def get_freq_points(target,df_orig,df_freq):
    new_colname = target +'_freq_points'
    
    try:
        df.insert(df.columns.get_loc(target),new_colname,False)
    except ValueError:
        print('Column probably already exists. Insertion failed.')
    
    if target == 'directors':
        col = 'director'
    elif target == 'writers':
        col = 'writer'
    else:
        col = target
        
    points = {}
    
    index = 0
    for x in df_orig[target]:
        total_points = 0
#         print(x)
        for y in x.split(", "):
            total_points += df_freq[df_freq[col] == y]['freq'].iloc[0]
#             print(f"{y}: {df_freq[df_freq[col] == y]['freq'].iloc[0]}")
#         print(f'{"Total points"}: {total_points}')
#         print('-----------------')
        df_orig[new_colname][index] = total_points
        index += 1

In [None]:
get_freq_points('directors',df,df_dir)

In [None]:
get_freq_points('writers',df,df_wri)

In [None]:
plt.figure(figsize = (10,8))
sns.barplot(data=df_dir.sort_values("freq",ascending=False).head(23), x="freq", y="director")
plt.title('Most Frequent Directors in Animated Movies')
plt.savefig('graphs/frequent_directors.png', bbox_inches='tight')
plt.show()

In [None]:
plt.figure(figsize = (10,8))
sns.barplot(data=df_wri.sort_values("freq",ascending=False).head(25), x="freq", y="writer")
plt.title('Most Frequent Writers in Animated Movies')
plt.show()

# Box Office

In [None]:
df[df['box_office'] > 0].worldwide_gross_usd.hist(bins=100)

In [None]:
np.cbrt(df[df['box_office'] > 0].worldwide_gross_usd).hist(bins=30)

In [None]:
np.sqrt(df[df['box_office'] > 0].budget_est_usd).hist(bins=30)

In [None]:
df['orig_bgt_currency'] = df['orig_bgt_currency'].apply(lambda x: 'other' if x not in ['usd','yen','pound'] else x)

In [None]:
sns.boxplot(data=df,x='orig_bgt_currency', y='budget_est_usd')
plt.tick_params(axis='x', rotation=45)
plt.show()

In [None]:
df.orig_bgt_currency.value_counts()

In [None]:
a = df.groupby(df[df['box_office'] > 0]['year']).budget_est_usd.median()

In [None]:
b = df.groupby(df[df['box_office'] > 0]['year']).profit_usd.median()

In [None]:
c = df.groupby(df[df['box_office'] > 0]['year']).worldwide_gross_usd.median()

In [None]:
years = a.index.astype(int)

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

axes[0].plot(years, a.values.astype(float).round(2), lw=2, color="orange", alpha=0.7, label="budget")
axes[1].plot(years, b.values.astype(float).round(2), lw=2, color="green", alpha=.7, label="profit")
axes[1].plot(years, c.values.astype(float).round(2), lw=2, color="blue", alpha=.3, label="worldwide gross")

axes[0].set_title("Average Box Office Movie Budget Over Time")
axes[1].set_title("Average Box Office Movie Profit/Worldwide Gross Over Time")

axes[0].set_xlabel("Year")
axes[1].set_xlabel("Year")

axes[0].set_ylabel("USD ($10 Million)")
axes[1].set_ylabel("USD ($10 Million)")

axes[1].annotate("As costs go up, profits become more volatile over time",
            xy=(550, 50), xycoords='figure pixels')

plt.legend(loc="upper left")
plt.savefig('graphs/box_office_costs_profits.png', bbox_inches='tight')
plt.show()

In [None]:
r = df.groupby(df['year']).runtime_minutes.mean().astype(float).round(2)
bpm = df.groupby(df['year']).avg_usd_budget_per_minute.median().astype(float).round(2)

In [None]:
fig, ax = plt.subplots(figsize=(16, 5))
ax2 = ax.twinx()

ax.plot(r.index, r.values.astype(float).round(2), lw=2, color="orange", alpha=0.7, label="runtime")
ax2.plot(bpm.index, bpm.values.astype(float).round(2), lw=2, color="red", alpha=0.4, label="runtime")

In [None]:
box_office_genre = round(pd.pivot_table(df_validBO, 
                                        index='genres',
                                        values='worldwide_gross_usd', 
                                        aggfunc = 'median'),3)

box_office_genre.sort_values('worldwide_gross_usd', ascending=False)

# Votes

In [None]:
sns.histplot(x=df['votes'])

# Votescore / Metacritic

In [None]:
def swap(l,a,b):
    temp = l[a]
    l[a] = l[b]
    l[b] = temp

plt.figure(figsize = (15,8))
sns.scatterplot(data=df[df['metacritic'] != -1], y="metacritic" ,x="votescore",
                s=80,hue="rating", palette="muted", style="year_period")

L = plt.legend(ncol=2)

current_handles, current_labels = plt.gca().get_legend_handles_labels()

current_labels[0] = "Rating"
current_labels[-1] = "1960 - 1985"
current_labels[-2] = "< 1960"
current_labels[-3] = "1985-2000"
current_labels[-4] = "> 2000"
current_labels[-5] = "\nYear Period"

swap(current_handles,-1,-3)
swap(current_handles,-1,-2)
swap(current_handles,-1,-4)

swap(current_labels,-1,-3)
swap(current_labels,-1,-2)
swap(current_labels,-1,-4)

# sort or reorder the labels and handles
plt.legend(current_handles,current_labels,ncol=2,labelspacing = .75)
plt.savefig('graphs/vote_meta2.png', bbox_inches='tight')
plt.show()

In [None]:
def metacritic_bin(x):
    if x >= 60:
        return 'green'
    elif x < 60 and x >= 40:
        return 'yellow'
    elif x < 40 and x >= 1:
        return 'red'
    else:
        return 'none'
    
df['metacritic_colorcode'] = df['metacritic'].map(metacritic_bin)

In [None]:
df['good_overall_score'] = df['avg_rating_score'].apply(lambda x: 1 if x > 60 else 0)

# END

In [None]:
df.columns

In [None]:
df.to_csv("imdb_eda.csv", index = False)