This is a beginner friendly notebook. We spend so much time performing analysis, making complicated models and tuning parameters for neural networks. But often times, a lot of the questions we want to answer can be tackle with just simple queries in SQL / Pandas without using such complicated models. In this notebook, we use only pandas to do quick analysis and address many 1st level questions to get a big picture about golden globe awards.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/golden-globe-awards/golden_globe_awards.csv


In [2]:
df = pd.read_csv("/kaggle/input/golden-globe-awards/golden_globe_awards.csv")

## Look at data / General Description

In [3]:
df.head()

Unnamed: 0,year_film,year_award,ceremony,category,nominee,film,win
0,1943,1944,1,Best Performance by an Actress in a Supporting...,Katina Paxinou,For Whom The Bell Tolls,True
1,1943,1944,1,Best Performance by an Actor in a Supporting R...,Akim Tamiroff,For Whom The Bell Tolls,True
2,1943,1944,1,Best Director - Motion Picture,Henry King,The Song Of Bernadette,True
3,1943,1944,1,Picture,The Song Of Bernadette,,True
4,1943,1944,1,Actress In A Leading Role,Jennifer Jones,The Song Of Bernadette,True


In [4]:
# Data Types for each column

df.dtypes

year_film      int64
year_award     int64
ceremony       int64
category      object
nominee       object
film          object
win             bool
dtype: object

In [5]:
# "Film" feature has some missing values
df.isnull().any()

year_film     False
year_award    False
ceremony      False
category      False
nominee       False
film           True
win           False
dtype: bool

In [6]:
# Fill missing film names with "Unknown"
df.film.fillna('Unknown', inplace=True)

In [7]:
df.isnull().any()

year_film     False
year_award    False
ceremony      False
category      False
nominee       False
film          False
win           False
dtype: bool

## How many awards were given out each year?

In [8]:
win_num_by_year = df[df.win==True].groupby('year_award').win.count().to_frame()
win_num_by_year

Unnamed: 0_level_0,win
year_award,Unnamed: 1_level_1
1944,6
1945,6
1946,7
1947,7
1948,12
...,...
2016,25
2017,25
2018,25
2019,25


It used to be less than 10 awards given in the first four years. Then, it increased a lot over the years. 

In [9]:
win_num_by_year.query('win >= 12 & win < 25')

Unnamed: 0_level_0,win
year_award,Unnamed: 1_level_1
1948,12
1949,13
1950,14
1951,16
1952,18
1953,19
1954,23
1965,23
1966,21
1967,21


In [10]:
win_num_by_year.query('win == 25')

Unnamed: 0_level_0,win
year_award,Unnamed: 1_level_1
1955,25
1956,25
1970,25
1983,25
1984,25
1987,25
1991,25
2007,25
2008,25
2009,25


Starting from the 1990s, 24 awards were more consistently given out and from 2007, 25 awards were given consistently.

## Who are the top 3 actors/actresses who won the most golden globes?

In [11]:
df[df.win==True].groupby('nominee').count().sort_values('win', ascending=False).head(3)

Unnamed: 0_level_0,year_film,year_award,ceremony,category,film,win
nominee,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Meryl Streep,8,8,8,8,8,8
Jane Fonda,7,7,7,7,7,7
Barbra Streisand,7,7,7,7,7,7


They are Meryl Streep, Jane Fonda and Barbra Sreisand!

## Which categories have the highest probability of winning golden globes once you get nominated?

In [12]:
df.groupby('category').win.apply(lambda x: sum(x==True)*100/x.count()).to_frame().sort_values('win',ascending=False).head(20)

Unnamed: 0_level_0,win
category,Unnamed: 1_level_1
Henrietta Award (World Film Favorite),100.0
Hollywood Citizenship Award,100.0
International News Coverage,100.0
Television Producer/Director,100.0
Television Achievement,100.0
New Foreign Star Of The Year - Actor,100.0
New Foreign Star Of The Year - Actress,100.0
Henrietta Award (World Film Favorites),92.1875
Actor In A Leading Role,87.5
Actress In A Leading Role,87.5


There are some categories where you win the award for sure once you get nominated (e.g. Hollywood citizens award, New Foreign Star Of The Year - Actor etc.). Categories such as Actor / Actress In A Leading Role, Picture and Cinematography have pretty high probability of winning once you get nominated (> 70%).

## Which film earned the most awards?

In [13]:
df[df.win==True].groupby('film').win.count().to_frame().sort_values('win').tail(10)

Unnamed: 0_level_0,win
film,Unnamed: 1_level_1
One Flew Over The Cuckoo's Nest,5
Midnight Express,5
Sex and The City,5
30 Rock,5
Lawrence Of Arabia,5
La La Land,6
"Carol Burnett Show, The",7
Alice,7
M*A*S*H (TV Show),7
Unknown,425


This might be misleading because there were a lot of missing values for "film names" and I just filled them all as "unknown". Nevertheless, based on the data we have, MASH, Alice, Carol Burnett Show were received the most Golden Globes awards followed by La La Land and Lawrence of Arabia.

## Who won the Supporting Role in any Motion Picture awards the most?

Often times, the importance of supporting roles is overlooked.

In [14]:
df[df.category=='Best Performance by an Actress in a Supporting Role in any Motion Picture'].groupby('nominee').\
win.count().to_frame().sort_values('win').tail(10)

Unnamed: 0_level_0,win
nominee,Unnamed: 1_level_1
Diane Ladd,3
Dianne Wiest,3
Nicole Kidman,3
Octavia Spencer,3
Amy Adams,4
Shelley Winters,4
Kate Winslet,4
Meryl Streep,5
Maureen Stapleton,5
Lee Grant,5


In [15]:
df[df.category=='Best Performance by an Actor in a Supporting Role in any Motion Picture'].groupby('nominee').\
win.count().to_frame().sort_values('win').tail(10)

Unnamed: 0_level_0,win
nominee,Unnamed: 1_level_1
Jason Robards Jr.,3
Al Pacino,3
Brad Pitt,3
Philip Seymour Hoffman,3
Christopher Plummer,3
Red Buttons,3
Robert Duvall,3
Joe Pesci,3
Ed Harris,4
Jack Nicholson,5


### Is there any correlation between length of title of film and its probability of winning awards?

Just out of curiosity (but expecting the correlation to be very weak)

In [17]:
# Getting word count of film title
df['film_word_count'] = df.film.str.split(" ").apply(lambda x: len(x))

# Replace True or False to 1 or 0
df.win.replace({True: 1, False: 0}, inplace=True)

In [22]:
df[['win','film_word_count']].corr().iloc[0,1]

0.009864149948747625

Correlation, jsut as we expected, is very weak