The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.

![](Nobel_Prize.png)

The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the `nobel.csv` file in the `data` folder.

In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!

In [1]:
# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np

# read the data file, check columns and inspect 1st few records
df_nobel = pd.read_csv('./data/nobel.csv')
print(df_nobel.columns)
df_nobel.head()

Index(['year', 'category', 'prize', 'motivation', 'prize_share', 'laureate_id',
       'laureate_type', 'full_name', 'birth_date', 'birth_city',
       'birth_country', 'sex', 'organization_name', 'organization_city',
       'organization_country', 'death_date', 'death_city', 'death_country'],
      dtype='object')


Unnamed: 0,year,category,prize,motivation,prize_share,laureate_id,laureate_type,full_name,birth_date,birth_city,birth_country,sex,organization_name,organization_city,organization_country,death_date,death_city,death_country
0,1901,Chemistry,The Nobel Prize in Chemistry 1901,"""in recognition of the extraordinary services ...",1/1,160,Individual,Jacobus Henricus van 't Hoff,1852-08-30,Rotterdam,Netherlands,Male,Berlin University,Berlin,Germany,1911-03-01,Berlin,Germany
1,1901,Literature,The Nobel Prize in Literature 1901,"""in special recognition of his poetic composit...",1/1,569,Individual,Sully Prudhomme,1839-03-16,Paris,France,Male,,,,1907-09-07,Châtenay,France
2,1901,Medicine,The Nobel Prize in Physiology or Medicine 1901,"""for his work on serum therapy, especially its...",1/1,293,Individual,Emil Adolf von Behring,1854-03-15,Hansdorf (Lawice),Prussia (Poland),Male,Marburg University,Marburg,Germany,1917-03-31,Marburg,Germany
3,1901,Peace,The Nobel Peace Prize 1901,,1/2,462,Individual,Jean Henry Dunant,1828-05-08,Geneva,Switzerland,Male,,,,1910-10-30,Heiden,Switzerland
4,1901,Peace,The Nobel Peace Prize 1901,,1/2,463,Individual,Frédéric Passy,1822-05-20,Paris,France,Male,,,,1912-06-12,Paris,France


In [2]:
# What is the most commonly awarded gender?
df_gender_country = df_nobel.loc[:, ["birth_country", "sex"]]
# df_gender_country.head()
gender_count_m = df_gender_country[df_gender_country["sex"] == "Male"]["sex"].count()
gender_count_w = df_gender_country[df_gender_country["sex"] == "Female"]["sex"].count()
gender_counts = [(gender_count_m, "Male"), (gender_count_w, "Female")]
gender_counts.sort(key=lambda x:x[0], reverse=True)  # sort tuples inplace on the counts of each gender
print(gender_counts)
top_gender = gender_counts[0][1]  # top gender is 1st tuple in list
top_gender

[(905, 'Male'), (65, 'Female')]


'Male'

In [3]:
# What is the most commonly awarded birth country?
df_country = df_nobel.loc[:, ["birth_country"]].sort_values(by="birth_country")
df_country["value"] = 1
df_country_grouped = df_country.groupby("birth_country")["value"].sum()
# print(type(df_country_grouped))  # <class 'pandas.core.series.Series'>
country_grouped_sorted = df_country_grouped.sort_values(ascending=False)
top_country = country_grouped_sorted.index[0]
top_country

'United States of America'

## Which decade had the highest ratio of US-born Nobel Prize winners to total winners in all categories?

+ Create count of USA winners over each decade
+ Create count of all winners over each decade
+ Join the USA and all winner dataframes together
+ Computer the ratio column in the combined dataframe

In [4]:
# Create count of USA winners over each decade
df_over_time = df_nobel.loc[:, ["year", "birth_country"]].sort_values(by=["year", "birth_country"])
# create decade column
df_over_time['decade'] = df_over_time["year"].astype(str).str[:3] + "0"
df_over_time['value'] = 1
df_usa = df_over_time.loc[df_over_time['birth_country'] == "United States of America", :]
usa_decades = df_usa.groupby("decade")["value"].sum()
usa_decades.sort_index(inplace=True)
print(usa_decades.shape)
usa_decades.head()

(13,)


decade
1900     1
1910     3
1920     4
1930    14
1940    13
Name: value, dtype: int64

In [5]:
# Create count of all winners over each decade
df_over_time = df_nobel.loc[:, ["year", "birth_country"]].sort_values(by=["year", "birth_country"])
# create decade column
df_over_time['decade'] = df_over_time["year"].astype(str).str[:3] + "0"
df_over_time['value'] = 1
all_win = df_over_time.groupby("decade")["value"].sum()
all_win.sort_index(inplace=True)
print(all_win.shape)
all_win.head()

(13,)


decade
1900    57
1910    40
1920    54
1930    56
1940    43
Name: value, dtype: int64

In [6]:
# Join the USA and all winner dataframes together
df_usa_all = pd.merge(all_win, usa_decades, left_index=True, right_index=True, suffixes=("_all", "_usa"))
# Computer the ratio column in the combined dataframe
df_usa_all["usa_ratio"] = df_usa_all["value_usa"] / df_usa_all["value_all"]
df_usa_all.sort_values("usa_ratio", ascending=False, inplace=True)
max_decade_usa = int(df_usa_all.index[0])
print(max_decade_usa)
# df_usa_all

2000


## Which decade and Nobel Prize category combination had the highest proportion of female laureates?

+ Create dataframe with count of female winners in each category for each decade
+ Create dataframe with count of all winners in each category for each decade
+ Join the female and all winner dataframes together
+ Computer the ratio column in the combined dataframe

In [7]:
# Create dataframe with count of female winners in each category for each decade
df_sex_cat = df_nobel.loc[:, ["year", "sex", "category"]]
# create decade column
df_sex_cat['decade'] = df_sex_cat["year"].astype(str).str[:3] + "0"
df_sex_cat['value'] = 1
# filter for women
df_fem_cat = df_sex_cat.loc[df_sex_cat['sex'] == "Female", ["decade", "category", "value"]]
df_fem_cat.head(10)


Unnamed: 0,decade,category,value
19,1900,Physics,1
29,1900,Peace,1
51,1900,Literature,1
62,1910,Chemistry,1
128,1920,Literature,1
141,1920,Literature,1
160,1930,Peace,1
179,1930,Chemistry,1
198,1930,Literature,1
218,1940,Literature,1


In [8]:
# aggegate to get counts per decade and category, but groupby creates a series with a multi-index
fem_cat_decades = df_fem_cat.groupby(["decade", "category"])["value"].sum()
dec_cat = fem_cat_decades.index.to_list()
dec_cat[0:10]

[('1900', 'Literature'),
 ('1900', 'Peace'),
 ('1900', 'Physics'),
 ('1910', 'Chemistry'),
 ('1920', 'Literature'),
 ('1930', 'Chemistry'),
 ('1930', 'Literature'),
 ('1930', 'Peace'),
 ('1940', 'Literature'),
 ('1940', 'Medicine')]

In [9]:
# break the multi-index apart so we can create a dataframe which can be joined
decades = [dec_cat[i][0] for i in range(len(dec_cat))]
categories = [dec_cat[i][1] for i in range(len(dec_cat))]
df_fem_cat_decs = pd.DataFrame({"decade": decades, "category": categories, "count_women": fem_cat_decades.values})
print(df_fem_cat_decs.shape)
df_fem_cat_decs.head()

(38, 3)


Unnamed: 0,decade,category,count_women
0,1900,Literature,1
1,1900,Peace,1
2,1900,Physics,1
3,1910,Chemistry,1
4,1920,Literature,2


In [10]:
df_all_cat = df_nobel.loc[:, ["year", "category"]]
# create decade column
df_all_cat['decade'] = df_all_cat["year"].astype(str).str[:3] + "0"
df_all_cat['value'] = 1
df_all_cat.head()
all_cat_decades = df_all_cat.groupby(["decade", "category"])["value"].sum()
all_cat_decades.head()

decade  category  
1900    Chemistry      9
        Literature    10
        Medicine      11
        Peace         14
        Physics       13
Name: value, dtype: int64

In [11]:
dec_cat_all = all_cat_decades.index.to_list()
# dec_cat_all[0:10]

# break the multi-index apart so we can create a dataframe which can be joined
decades_all = [dec_cat_all[i][0] for i in range(len(dec_cat_all))]
categories_all = [dec_cat_all[i][1] for i in range(len(dec_cat_all))]
# print(len(decades_all), len(categories_all), len(all_cat_decades.values))
df_all_cat_decs = pd.DataFrame({"decade": decades_all, "category": categories_all, "count_all": all_cat_decades.values})
# print(df_all_cat_decs.shape)
df_all_cat_decs.head()

Unnamed: 0,decade,category,count_all
0,1900,Chemistry,9
1,1900,Literature,10
2,1900,Medicine,11
3,1900,Peace,14
4,1900,Physics,13


In [12]:
# Join the female and all winner dataframes together

df_female_all = pd.merge(df_fem_cat_decs, df_all_cat_decs, how="left", on=["decade", "category"])
df_female_all["prop_female"] = df_female_all["count_women"] / df_female_all["count_all"]
df_female_all.sort_values(by="prop_female", ascending=False, inplace=True)
# print(df_female_all.shape)
df_female_all.head()  # 2020, 0.5 in Literature

Unnamed: 0,decade,category,count_women,count_all,prop_female
34,2020,Literature,2,4,0.5
30,2010,Peace,5,14,0.357143
23,2000,Literature,3,10,0.3
18,1990,Literature,3,10,0.3
28,2010,Literature,3,10,0.3


In [13]:
df_max_female = df_female_all.iloc[:1, :]
max_female_decade = int(df_max_female.iloc[:1, 0].values[0])
max_female_category = df_max_female.iloc[:1, 1].values[0]
# print(max_female_decade, max_female_category)
max_female_dict = {max_female_decade : max_female_category}

## Who was the first woman to receive a Nobel Prize, and in what category?

In [14]:
df_fem_year_cat_name = df_nobel.loc[df_sex_cat['sex'] == 'Female']
df_fem_sorted = df_fem_year_cat_name.loc[:, ['year', 'category', 'full_name']].sort_values("year")
df_first_woman = df_fem_sorted.iloc[:1, :]
df_first_woman

Unnamed: 0,year,category,full_name
19,1903,Physics,"Marie Curie, née Sklodowska"


In [15]:
first_woman_name = df_first_woman['full_name'].values[0]
first_woman_category = df_first_woman['category'].values[0]
print(first_woman_name, first_woman_category)

Marie Curie, née Sklodowska Physics


## Which individuals or organizations have won more than one Nobel Prize throughout the years?

<s>Next couple of cells shows that there are far more organization repeats than individuals (which makes sense intuitively) and that **University of California** has won the most Nobel's.</s>

Just need to look at the `full_name` column (originally through I needed to look at `organization_name`, but this is not the case).

In [16]:
df_names_orgs = df_nobel.loc[:, ['full_name', 'organization_name']]
name_counts = df_names_orgs.value_counts(subset='full_name', ascending=False)
print(name_counts.shape[0])  # 993 org's before removing singletons
names_more_than_once = name_counts.loc[lambda x: x > 1]
print(names_more_than_once.shape[0])
print(type(names_more_than_once))
names_list = names_more_than_once.index.to_list()
names_list[:10]

993
6
<class 'pandas.core.series.Series'>


['Comité international de la Croix Rouge (International Committee of the Red Cross)',
 'Office of the United Nations High Commissioner for Refugees (UNHCR)',
 'Frederick Sanger',
 'Linus Carl Pauling',
 'John Bardeen',
 'Marie Curie, née Sklodowska']

In [17]:
# org_counts = df_names_orgs.value_counts(subset='organization_name', ascending=False)
# print(org_counts.shape[0])  # 325 before filtering out singletons
# orgs_more_than_once = org_counts.loc[lambda x: x > 1]
# print(orgs_more_than_once.shape[0])  # 102 after filtering out singletons
# orgs_list = orgs_more_than_once.index.to_list()
# orgs_list[:10]

In [18]:
repeat_set = set(names_list)  # remove duplicates
repeat_list = list(repeat_set)
print(len(repeat_list))
repeat_list

6


['Frederick Sanger',
 'Comité international de la Croix Rouge (International Committee of the Red Cross)',
 'John Bardeen',
 'Marie Curie, née Sklodowska',
 'Office of the United Nations High Commissioner for Refugees (UNHCR)',
 'Linus Carl Pauling']