The Year field in our csv file refers to the year when the film was released. The Academy
Awards are given for films released in the  calendar year prior to the ceremony. And so in the
discussion that follows, we'll speak of e.g. some 1943 award (ceremony date), but the relevant
records in our csv file and dataframes will have the year 1942 (the release date of the film in question).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import pyarrow
nl = '\n'
df = pd.read_csv('./data/oscars.csv', dtype_backend='pyarrow', sep='\t')
print(f'columns={df.columns}')
columns = ['Year', 'CanonicalCategory', 'Film', 'FilmId', 'Name', 'Winner', 'Nominees']
df = df.loc[:, columns].rename({'CanonicalCategory': 'Category'}, axis=1)
df['Winner'] = df['Winner'].apply(lambda x: True if not pd.isna(x) else False)
df['Year'] = pd.to_numeric(df['Year'].str.replace('../', '', regex=True))

In [None]:
remakes_df = (
    df.loc[:, ['FilmId', 'Film']].drop_duplicates()
      .rename({'Film': 'Title'}, axis=1).loc[:, 'Title']
      .value_counts().reset_index(name='count')
)
print(remakes_df.iloc[:5, :].to_string(index=False))

In [None]:
print('\nNominations')
by_film_id = df.groupby('FilmId')
film_id_to_title = (
    by_film_id.nth(0).loc[:, ['FilmId', 'Film']]
              .set_index('FilmId').to_dict()['Film']
)
nominations = df['FilmId'].value_counts().reset_index(name='count')
nominations['Title'] = (
   nominations['FilmId'].apply(lambda x: film_id_to_title[x])  
)
nominations = nominations.set_index('Title', drop=True).drop('FilmId', axis=1)
nominations.head()


In [None]:

winners = (
    df.loc[df['Winner']].groupby('FilmId').size()
      .reset_index(name='count').sort_values('count', ascending=False)
)
winners['Title'] = (
   winners['FilmId'].apply(lambda x: film_id_to_title[x])
)                    
winners = winners.drop('FilmId', axis=1).set_index('Title')
winners

In [None]:
cats = df['Category'].drop_duplicates()
#print(f'cats are\n{nl.join(cats.to_list())}')
actor_cats = cats[cats.str.contains('ACT')]
print(f'actor_cats are\n{nl.join(actor_cats.to_list())}')

In [None]:
actor_nominees = (
    df.query('(Category in @actor_cats)')['Name'].value_counts()
)
actor_nominees.head()

In [None]:
actor_winners = (
    df.query('Winner & (Category in @actor_cats)')['Name'].value_counts()
)
actor_winners.head()

In [None]:
career_nominations_gb_actor = (
    df.query('Category in @actor_cats').loc[:, ['Year', 'Name']]
        .sort_values('Year').groupby('Name')
)    
earliest, latest = (
    career_nominations_gb_actor.nth(X).set_index('Name').rename({'Year': Y}, axis=1)
    for X, Y in zip([0, -1], ['Earliest', 'Latest'])
)                                                
multiply_nominated = pd.concat([earliest, latest], axis=1)
multiply_nominated['Span'] = multiply_nominated.apply(
    (lambda row: row['Latest'] - row['Earliest']), axis=1
)
print(multiply_nominated.sort_values('Span', ascending=False).head())


The question of how many awards are given each year is tricky. It's not the same as the number of categories. Multiple awards are given in each of the scientific/technical categories and in some honorary category. Let's look at that for the first few years from 2000 onward. The category names here can have unwiledy lengths, so we'll turn them into acronyms.

It should be metioned that these records give the year that the film was made. The Academy Awards are always for films released in the previous calendar year.

In [None]:
cols = ['Year', 'Category', 'Winner', 'Film', 'Name']
count_by_yrcat = (
    df.query('Winner').loc[:, cols]
      .groupby(['Year', 'Category']).size().reset_index(name='count')

)
def shorten_categories(series):
    substitutions = {
        'SCIENTIFIC (?:AND|OR) TECHNICAL AWARD': 'SATA',
        'Scientific and Engineering Award': 'SEA',
        'Technical Achievement Award': 'TAA', 'Academy Award of Merit': 'AAM',
        'SPECIAL ACHIEVEMENT AWARD': 'SAA',
        'JOHN A. BONNER MEDAL OF COMMENDATION': 'BONNER',
        'JEAN HERSHOLT HUMANITARIAN AWARD': 'HERSHOLT',
        'DOCUMENTARY': 'DOC', 'COMMENDATION': 'CMND',
        'SHORT FILM': 'SHORT',
        'IRVING G. THALBERG MEMORIAL AWARD': 'THALBERG',
        'GORDON E. SAWYER AWARD': 'SAWYER',
        'Original': 'Orgl',
        'Song Score': 'SS', 'Adaptation Score': 'AS',
        'Black-and-White': 'BW', 'IN A LEADING ROLE': '(Lead)',
        'IN A SUPPORTING ROLE': '(Supp.)',
        ' Picture': '', 'Live Action': 'LA'
    }
    first, new_series = True, None
    for source, target in substitutions.items():
        from_series = series if first else new_series
        new_series = from_series.str.replace(source, target, regex=True)
        first = False
    return new_series

count_by_yrcat['Category'] = shorten_categories(count_by_yrcat['Category'])
print(
    'To make sense out of the acronymized categories below, please refer to the\n'
    '"substitutions" dictionary in the code cell above:\n'
)
print(
    count_by_yrcat.query(
        '(2000 <= Year <= 2005) and (count > 1)', engine='python'
    )
    .loc[:, ['Year', 'Category', 'count']].to_string(index=False)
)

Only the Awards of Merit get Oscar statuettes; the Sceintific and Engineering Awards get bronze tablets; and the Technical Achievement Awards get certficates.

Let's exclude scientific and honorary categories. Are there multiple awards in any of the remaining categories in any year?

In [None]:
scihon_cats = 'SATA|MEDAL|HERSHOLT|BONNER|CMND|SPECIAL|HONORARY|THALBERG'
print(
    count_by_yrcat.query(
        '(count > 1) & ~(Category.str.contains(@scihon_cats))', engine='python')
         .to_string(index=False)
)

Let's look at awards for fims made in 1928 in these non-technical and non-honorary categories.

In [None]:
cols1 = ['Year', 'Category', 'Film', 'Name']
         
multiple_nonsci_gb_yrcat = (
    df.query('Winner & ~(Category.str.contains(@scihon_cats))', engine='python')
       .loc[:, cols1].groupby(['Year', 'Category'])
)

ctr = 0
for name, group in multiple_nonsci_gb_yrcat:
    if name[0] != 1928:
        break
    if group.shape[0]> 1:
        print(name)
        print(group.loc[:, ['Film', 'Name']]
              .to_string(index=False))
        ctr += 1

In the first few years of the Academy Awards, it sometinmes happened that an actor or director works in
more than one film in a year. These artists were nominated and won not for their
work in a particular film as it is today, but rather for their work in all films they worked in
during the relevant year. The systems changed to the way it is today starting with the 4th Academy Awards. See [the Wikipedia article on the Academy Awards](https://en.wikipedia.org/wiki/Academy_Awards#Milestones).

So what about those two Actress awards in 1969?

In [None]:
for name, group in multiple_nonsci_gb_yrcat:
    if name == (1968, 'ACTRESS'):
        print(group.to_string())

It turns out that Streisand and Hepburn each got exactly 3,030 votes and so two Oscars had to be awarded. [There have been six ties in the history of the Academy Awards](https://www.thewrap.com/6-times-oscars-tied-photos/). Let's see if we can find these occasions in our database:

In [None]:
cols1 = ['Year', 'Category', 'Name']
yrcatname = (
    df.query('Winner & ~(Category.str.contains(@scihon_cats))', engine='python')
       .loc[:, cols1].drop_duplicates().groupby(['Year', 'Category']).size()
       .reset_index(name='size').query('size > 1')
)
yrcatname


Apart from the six ties documented by TheWrap, there are two further anomalies of multiple awards win a single category: seven Assistant Director awards in 1934 and four awards for feature-length documentary in 1943. Let's first look at the AD Awards in general.

In [None]:
print(
    df.query('Winner & (Category == "ASSISTANT DIRECTOR")')
      .loc[:, ['Year', 'Name', 'Film']].to_string(index=False)
)

The AD category ended in 1938. In its inaugural year of 1934, seven different ADs received
the award, none of them for any film in particular.

In [None]:
print(df.query('Winner & (Year == 1933) & (Category == "ASSISTANT DIRECTOR")',
               engine='python').drop(columns=['FilmId', 'Film', 'Winner']))

What about the four feature-length documentary winners in 1943?

In [None]:
print(df.query('Winner & (Year == 1942) & (Category == "DOCUMENTARY (Feature)")',
               engine='python').drop(columns=['FilmId', 'Winner', 'Category'])
               .to_string(index=False)
)

The data here appear to be erroneous: according to other sources, there was only one 1943 award for feature-length documentary and in fact it went to The Battle of Midway. See e.g.[the Academy's own website for the 1943 awards](https://www.oscars.org/oscars/ceremonies/1943)

Let's assume that every kind of technical award counts as an award, not just the Awards of Merit. There is only one record per winner (as opposed to the first Actor and Actress awards, which had two and three records to reflect all the films the winning actor/actress had appeared in in the previous year).

And so, to count the number of techical/honorary awards for each year, we just count the number of records with Winner == True in the technical/honorary categories for that year. To count the number of awards in the other categories, we count the number of distinct (Category, Name) pairs for that year, where Category belongs to the non-technical/honorary categories.

Finally, we should account for the apparent error in the data for the 1943 awards for feature-length Documentary by subtracting 3 from the number of awards for that year.

In [None]:
scihon_winners_by_year = (
    count_by_yrcat.loc[:, ['Year', 'Category', 'count']]
            .query('Category.str.contains(@scihon_cats)', engine='python')
)

print(scihon_winners_by_year.head().to_string(index=False))
print(scihon_winners_by_year.query(
        'Year == 1998', engine='python'
      ).to_string(index=False)
)

Wait a second! Were there really 38 scientific and honorary awards in 1999?

In [None]:
scihon_details = df.loc[df['Winner'], ['Year', 'Category', 'Name', 'Nominees']]
scihon_details['Nominees'] = scihon_details['Nominees'].map(
    lambda x: f'{x[:30]}...' if  pd.notna(x) and len(x) > 30 else x)
scihon_details['Category'] = shorten_categories(scihon_details['Category'])
scihon_details = scihon_details.query(
    'Category.str.contains(@scihon_cats)', engine='python')
print(scihon_details.query('Year == 1998', engine='python')
      .to_string(index=False)
)

According to [the Academy website for the 1999 scientific and technical awards]
(https://www.oscars.org/sci-tech/ceremonies/1999), all the SATA nominees listed here actually got awards.
And Gray, Kazan, and Jewison did indeed get the honorary awards shown above.
Ok, so now let's get the yearly tally of scientifc and honorary awards.

In [None]:
scitech_hon_awards_df = (
    scihon_winners_by_year.drop('Category', axis=1).groupby('Year').sum()
    .rename(columns={'count': 'scitech_hon_awards'})
)
print(scitech_hon_awards_df.to_string())

In [None]:
non_scitech_hon_awards_df = df.query('Winner').loc[:, cols]
non_scitech_hon_awards_df['Category'] = shorten_categories(
    non_scitech_hon_awards_df['Category']
)
non_scitech_hon_awards_df = (
    non_scitech_hon_awards_df.query(
        '~Category.str.contains(@scihon_cats)', engine='python'
    )
    .drop(columns=['Winner', 'Film']).drop_duplicates().loc[:, ['Year', 'Name']]
    .groupby('Year').count().rename(columns={'Name': 'non_scitech_hon_awards'})
)
# Correct for erroneous data for 1943 awards for short documentary
non_scitech_hon_awards_df.at[1942, 'non_scitech_hon_awards'] -=3
print(non_scitech_hon_awards_df.to_string())

We concatenate the series for scientific/honorary awards with that for all other awards, and sum the two. Finally, we adjust the year forward by one to reflect the year of the award ceremony rather than the year the awarded work was done. We plot the total.

In [None]:
awards_df = pd.concat(
    [scitech_hon_awards_df, non_scitech_hon_awards_df], axis=1
).fillna(0).astype(int).sort_index(axis='index')
awards_df.index = awards_df.index.map(lambda x: x + 1)
awards_df['total'] = (
    awards_df['scitech_hon_awards'] + awards_df['non_scitech_hon_awards']
)
awards_df.loc[:, 'total'].plot()