## 1. The most Nobel of Prizes
<p>Alfred Nobel was a Swedish chemist, engineer, and industrialist most famously known for the invention of dynamite. Alfred Nobel died in 1896. In his will, he bequeathed all of his "remaining realisable assets" to be used to establish five prizes which became known as "Nobel Prizes". Nobel Prizes were first awarded in 1901. In 1968, a sixth prize was established in the field of Economic Sciences; however, it is not considered a "Nobel Prize" but a "Nobel Memorial Prize".
The Nobel Prize is not a single prize, but five separate prizes that, according to Alfred Nobel's 1895 will, are awarded "to those who, during the preceding year, have conferred the greatest benefit to humankind”.
Nobel Prizes are awarded in the fields of <em>Physics, Chemistry, Physiology or Medicine, Literature, and Peace</em>. Nobel Prizes are widely regarded as the most prestigious awards available in their respective fields.
    The prize ceremonies take place annually. Each recipient (known as a "<em>Laureate</em>") receives a gold medal, a diploma, and a monetary award. A prize may not be shared among more than three individuals, although the Nobel Peace Prize can be awarded to organizations of more than three people. Although Nobel Prizes are not awarded posthumously, if a person is awarded a prize and dies before receiving it, the prize is presented.<br><em>Source: https://en.wikipedia.org/wiki/Nobel_Prize</em><br><p>The Nobel Foundation has made a dataset available of all prize winners from the start of the prize, in 1901, to 2016.</p>

In [None]:
# Loading in required libraries

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import squarify
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import dateutil

# Reading in the Nobel Prize data
nobel = pd.read_csv("../input/nobel-laureates/archive.csv")

### Making an assessment of the data

In [None]:
# Taking a look at the dataset information
print("The dataset information is listed below:")
nobel.info()

In [None]:
# To get number of rows count quickly
print('There are '+ str(nobel.shape[0]) + ' rows and '+ str(nobel.shape[1]) +' columns')

In [None]:
# And to get all the column names
nobel.columns = ['Year', 'Category', 'Prize', 'Motivation', 'Prize_Share', 'Laureate ID',
       'Laureate_Type', 'Full Name', 'Birth_Date', 'Birth City',
       'Birth_Country', 'Sex', 'Organization Name', 'Organization City',
       'Organization Country', 'Death_Date', 'Death_City', 'Death_Country']
nobel.columns

In [None]:
# Get a summary of the dataset statistics
nobel.describe()

### Creating diffirent datasets

In [None]:
PhysicsDF = nobel[(nobel.Category == 'Physics')]
ChemistryDF = nobel[(nobel.Category == 'Chemistry')]
MedicineDF = nobel[(nobel.Category == 'Medicine')]
LiteratureDF = nobel[(nobel.Category == 'Literature')]
PeaceDF = nobel[(nobel.Category == 'Peace')]
EconomicsDF = nobel[(nobel.Category == 'Economics')]

### Since when Nobel prices were awarded?

In [None]:
first_year = min(nobel['Year'])
last_year = max(nobel['Year'])
n_years_since = last_year - first_year

n_dist_year = nobel['Year'].nunique()
n_dist_cat = nobel['Category'].nunique()
n_dist_lau = nobel['Laureate ID'].nunique()

n_no_award_year = last_year - first_year - n_dist_year + 1

dist_cat = nobel['Category'].nunique()
n_lau = nobel['Laureate ID'].value_counts()

In [None]:
print('First awarded in the year :', first_year)
print('Last awarded in the year :', last_year)
print('No. of years since first award :', n_years_since)
print('No. of unique years at least one award is given :', n_dist_year)
print('No. of unique years no award is given :', n_no_award_year)

## 2. Who gets the Nobel Prize?
<p>As it's seen from the data, all of the winners in 1901 were Europeans. But an analysis can be done for looking how this changed including gender and which country is the most commonly represented? </p>
<p>(For <em>country</em>, we will use the <em>birth_country</em> of the winner, as the <em>organization_country</em> is <em>NaN</em> for all shared Nobel Prizes.)</p>

In [None]:
# Display the number of Nobel Prizes (possibly shared) handed out between 1901 and 2016
display(len(nobel))

# Display the number of prizes won by recipient's gender.
display(nobel['Sex'].value_counts())

# Display the number of prizes won by the top 12 nationalities.
nobel['Birth_Country'].value_counts().head(12)

In [None]:
# plotting a tree map

y = nobel['Birth_Country'].value_counts().head(12)

plt.rcParams['figure.figsize'] = (15, 15)
plt.style.use('fivethirtyeight')

color = plt.cm.summer(np.linspace(0, 1, 15))
squarify.plot(sizes = y.values, label = y.index, alpha=.8, color = color)
plt.title('Tree Map of number of prizes won by nationality', fontsize = 20, color="black")

plt.axis('off')
plt.show()

## 3. So, who gets the Nobel Prize in each Category?
<p>Below we have a look now to each individual category.</p>

In [None]:
print(PhysicsDF['Birth_Country'].value_counts().head(10))

In [None]:
phydf=pd.crosstab(PhysicsDF.Birth_Country, PhysicsDF.Sex)

In [None]:
#phydf.head(10)
phydf.plot.bar(stacked=True)
plt.legend(title='Gender')
plt.show()

In [None]:
print(ChemistryDF['Birth_Country'].value_counts().head(10))

In [None]:
chedf=pd.crosstab(ChemistryDF.Birth_Country,ChemistryDF.Sex)

In [None]:
#chedf.head(10)
chedf.plot.bar(stacked=True)
plt.legend(title='Gender')
plt.show()

In [None]:
print(MedicineDF['Birth_Country'].value_counts().head(10))

In [None]:
meddf=pd.crosstab(MedicineDF.Birth_Country,MedicineDF.Sex)

In [None]:
#meddf.head(10)
meddf.plot.bar(stacked=True)
plt.legend(title='Gender')
plt.show()

In [None]:
print(LiteratureDF['Birth_Country'].value_counts().head(10))

In [None]:
litdf=pd.crosstab(LiteratureDF.Birth_Country,LiteratureDF.Sex)

In [None]:
#litdf.head(10)
litdf.plot.bar(stacked=True)
plt.legend(title='Gender')
plt.show()

In [None]:
print(PeaceDF['Birth_Country'].value_counts().head(10))

In [None]:
peadf=pd.crosstab(PeaceDF.Birth_Country,PeaceDF.Sex)

In [None]:
#peadf.head(10)
peadf.plot.bar(stacked=True)
plt.legend(title='Gender')
plt.show()

In [None]:
print(EconomicsDF['Birth_Country'].value_counts().head(10))

In [None]:
ecodf=pd.crosstab(EconomicsDF.Birth_Country,EconomicsDF.Sex)

In [None]:
#ecodf.head(10)
ecodf.plot.bar(stacked=True)
plt.legend(title='Gender')
plt.show()

## 4. USA tops the ranking
<p>The USA shows a clear dominance in the number of award winners, in fact the only category where the USA is not on top is Literature. As it was mentioned earlier, in 1901 all the winners were European. So, it will be an interesting analysis to find out when the USA started to dominate the Nobel Prize charts.</p>

In [None]:
nobel['Decade'] = ((nobel['Year'] / 10) * 10).astype(int)
# Working out the proportion of USA born winners per decade
nobel['USA Born Winner'] = nobel['Birth_Country'] == 'United States of America'
nobel['Decade'] = (np.floor(nobel['Year'] / 10) * 10).astype(int)
prop_usa_winners = nobel.groupby('Decade', as_index=False)['USA Born Winner'].mean()

# Displays the proportions of USA born winners per decade
prop_usa_winners

<p>As the graph shows below, the USA started to dominate the Nobel charts in the 1930's, with a share of a quarter of the winners.</p>

In [None]:
# Setting the plotting theme
sns.set()
# and setting the size of all plots.
plt.rcParams['figure.figsize'] = [11, 7]

# Plotting USA born winners 
ax = sns.lineplot(x='Decade', y='USA Born Winner', data=prop_usa_winners)

# Adding %-formatting to the y-axis
from matplotlib.ticker import PercentFormatter
ax.yaxis.set_major_formatter(PercentFormatter(1.0))

## 5. What words are most frequently written in the prize motivation?
<p>A table is OK, but to <em>see</em> when the USA started to dominate the Nobel charts we need a plot!</p>

In [None]:
text = nobel.Motivation[0]
wordcloud = WordCloud(max_font_size=40, max_words=50, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

## 6. Gender of Nobel Prize Laureates
<p>As we saw, the USA dominates the ranking of Nobel Prize Laureates since the 1930s until now. Masculine gender of Nobel Prizes Lareates is also observed since 1901. The imbalance in the gender of the Laureates is rather high with also very significant differences in the Prize categories.</p>

In [None]:
# Proportion of female Laureates per decade
nobel['Female Winner'] = nobel['Sex'] == 'Female'
prop_female_winners = nobel.groupby(['Decade', 'Category'], as_index=False)['Female Winner'].mean()

# USA born winners with % winners on the y-axis
ax = sns.lineplot(x='Decade', y='Female Winner', hue='Category', data=prop_female_winners)
ax.yaxis.set_major_formatter(PercentFormatter(1.0))

## 7. First female to win the Nobel Prize
<p>The plot above shows some interesting trends and patterns. Overall the imbalance is notoriously large in <em>Physics, Economics, and Chemistry</em>. Medicine shows a slightly positive trend. Since the 1990s, Literature is also more balanced. The highest positive trend is shown in the 2010s by the Peace Prize, however, data is only available for 6 years (up until 2016).</p>
<p>After analysing the data, the first woman to receive a Nobel Prize and category is shown below.</p>

In [None]:
# First female to win a Nobel Prize
nobel[nobel.Sex == 'Female'].nsmallest(1, 'Year')

## 8. Repeat laureates
<p>For most scientists/writers/activists a Nobel Prize would be the crowning achievement of a long career. There are however, Laureates that received the Nobel Prize in more than oncasion. Below is a list of Lareates that received the Nobel Prize in more than one ocasion.</p>

In [None]:
# Selecting the Laureates that have received 2 or more prizes.
nobel.groupby('Birth_Country').filter(lambda group: len(group) >= 50)

<p>As its shown, Marie Curie, got the prize in <em>Physics</em> for discovering radiation and in <em>Chemistry</em> for isolating radium and polonium. John Bardeen got it also twice in <em>Physics</em> for transistors and superconductivity, Frederick Sanger got it twice in <em>Chemistry</em>, and Linus Carl Pauling got it first in <em>Chemistry</em> and later in <em>Peace</em> for his work in promoting nuclear disarmament. We also learn that organizations also get the Nobel Prize as both the Red Cross and the UNHCR have gotten it twice.</p>

## 9. At what age the Lareates received the Nobel Prize
<p>How aold are/were the Laureates when they received the Nobel Prize? This is shown below.</p>

In [None]:
# Converting birth_date from String to datetime
nobel['Birth_Date'] = pd.to_datetime(nobel['Birth_Date'], errors='coerce')

# Obtaining the age of Nobel Prize winners
nobel['Age'] = nobel['Year'] - nobel['Birth_Date'].dt.year

# Plotting the age of Nobel Prize winners
sns.lmplot(x='Year', y='Age', data=nobel, aspect=2, line_kws={'color' : 'black'})

<p>As we see in the plot above up the the 1940s, Nobel Prize Lareates were 55 years old on Avergae when they received their Prize. Nowadays that average is closer to 65 years old. However, there is a large spread in the Laureates' ages, even though the majority are 50+, there are also some quite young.</p>
<p>The scatter in the plot also shows higher density in the late years in comparison to the ealry 1900s. It has to be considered as well that more of the prizes are shared, resulting in more winners. A disruption in the awards is also observed in the years 1939 to 1945 due to the Second World War.</p>

## 10. Age differences between prize categories
<p>Age trends in the different prize categories are shown below.</p>

In [None]:
# Similar plot as above, but with separate plots for each Nobel Prize category
sns.lmplot(x='Year', y='Age', row='Category', data=nobel, aspect=2, line_kws={'color' : 'black'})

<p>From the plots above, we can see that Lareates of the <em>Physics, Chemistry and Medicine</em> prizes have gotten older over time. <em>Physics</em> shows the highest average. The average age of this category was below 50 years old, and now it's almost 70 years old. Economics category was first introduced in 1968. An oposite trend is observed in the <em>Peace</em> category, where Laureates are obtaining their Prizes younger.</p>
<p>In 2010 an exceptionally young Lareate obtained the <em>Nobel Peace Prize</em>.</p>

## 11. Oldest and youngest Laureates
<p>The oldest and youngest Laureates ever to have won a Nobel Prize are shown below.</p>

In [None]:
# The oldest Nobel Prize Laureate as of 2016 is
display(nobel.nlargest(1, 'Age'))

# The youngest Nobel Prize Laureate as of 2016 is
nobel.nsmallest(1, 'Age')

<p><em>Leonid Hurwicz</em> was 90 years old when he received the <em>Nobel Prize</em>.</p>
<p><em>Malala Yousafzai</em> was 17 years old when she received the <em>Nobel Prize</em>.</p>

## 12. Organizations with the largest number of Nobel Prizes
<p>The Organizations with the largest number of Nobel Prizes are shown below.</p>

In [None]:
c = nobel['Organization Name'].value_counts()
plt.figure(figsize=(5,12))
UniversitiesGraph = sns.countplot(y="Organization Name", data=nobel,
              order=c.nlargest(30).index,
              palette='rocket')
plt.show()

## 13. Final comments
<p>Thanks for looking at my notebook.</p>