# Setup and Context

### Introduction

On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.

Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.

Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace. 

<img src=https://i.imgur.com/36pCx5Q.jpg>

Let's see what patterns we can find in the data of the past Nobel laureates. What can we learn about the Nobel Prize and our world more generally?

### Upgrade plotly (only Google Colab Notebook)

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server. 

In [1]:
%config Completer.use_jedi = False

### Import Statements

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

### Notebook Presentation

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

### Read the Data

In [None]:
df_data = pd.read_csv('nobel_prize_data.csv')

Caveats: The exact birthdays for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd.


# Data Exploration & Cleaning

**Challenge**: Preliminary data exploration. 
* What is the shape of `df_data`? How many rows and columns?
* What are the column names?
* In which year was the Nobel Prize first awarded?
* Which year is the latest year included in the dataset?

In [None]:
df_data.shape

In [None]:
df_data.columns

**Challenge**:
* Are there any duplicate values in the dataset?
* Are there NaN values in the dataset?
* Which columns tend to have NaN values?
* How many NaN values are there per column? 
* Why do these columns have NaN values?  

### Check for Duplicates

In [None]:
df_data.duplicated().any()

In [None]:
df_data[df_data.duplicated()]

### Check for NaN Values

In [None]:
df_data.isna().any()

In [None]:
df_data.isna().sum()

In [None]:
df_data[df_data['birth_date'].isna()]

In [None]:
df_data[df_data['organization_name'].isna()]

In [None]:
df_data[df_data['organization_city'].isna()]

### Type Conversions

**Challenge**: 
* Convert the `birth_date` column to Pandas `Datetime` objects
* Add a Column called `share_pct` which has the laureates' share as a percentage in the form of a floating-point number.

#### Convert Year and Birth Date to Datetime

In [None]:
df_data['year'] = pd.to_datetime(df_data['year'], format='%Y')

#### Add a Column with the Prize Share as a Percentage

In [None]:
df_data['share_pct'] = df_data['share'].astype(float) * 100

In [None]:
df_data.head()

# Plotly Donut Chart: Percentage of Male vs. Female Laureates

**Challenge**: Create a [donut chart using plotly](https://plotly.com/python/pie-charts/) which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?

In [None]:
# using pie chart compare how many men vs women won the Nobel Prize
fig = df_data
df_data = px.data.tips()
fig = px.pie(df_data, values=Male)
fig = px.pie(df_data, values='tip', names='day')
fig.show()

In [None]:
df_data = px.data.tips()
fig = px.pie(df_data, values='tip', names='day')
fig.show()

# Who were the first 3 Women to Win the Nobel Prize?

**Challenge**: 
* What are the names of the first 3 female Nobel laureates? 
* What did the win the prize for? 
* What do you see in their `birth_country`? Were they part of an organisation?

In [None]:
# what are the names of the first 3 female nobel laureates
names = df_data['fullname'].head(3)
print(names)

In [None]:
# what did they win the prize for
prize = df_data['prize_name'].head(3)
print(prize)

In [None]:
# what do you see in their birth_country? were they part of an organization
birth_country = df_data['birth_country'].head(3)
print(birth_country)

In [None]:
# what do you see in their birth_country? were they part of an organization
organization_name = df_data['organization_name'].head(3)
print(organization_name)

# How Many People Have Won More than One Prize?

**Challenge**: How many people have won more than one prize?
* Hint: Use `value_counts()`

In [None]:
# how many people have won more than one prize
df_data['fullname'].value_counts()

# Find the Repeat Winners

**Challenge**: Did some people get a Nobel Prize more than once? If so, who were they? 

In [None]:
#  did some people get a nobel prize more than once if so who were they
df_data['fullname'].value_counts()

In [None]:
df_data['prize_name'].value_counts()

# Number of Prizes per Category

**Challenge**: 
* In how many categories are prizes awarded? 
* Create a plotly bar chart with the number of prizes awarded by category. 
* Use the color scale called `Aggrnyl` to colour the chart, but don't show a color axis.
* Which category has the most number of prizes awarded? 
* Which category has the fewest number of prizes awarded? 

In [None]:
# in how many catergories are prizes awarded
df_data['category'].value_counts()

In [None]:
# create a plotly bar chart with the number of prizes awarded by category
fig = px.bar(df_data, x='category', y='prize_name', color='category', color_discrete_sequence=px.colors.sequential.Aggrnyl)
fig.show()

In [None]:
# which category has the most number of prizes awarded
df_data['category'].value_counts()

**Challenge**: 
* When was the first prize in the field of Economics awarded?
* Who did the prize go to?

In [None]:
# when was the first prize in the field of econonmics awarded
df_data['year'].value_counts()

In [None]:
# who did the prize go to?
df_data['fullname'].value_counts()

In [None]:
df_data['prize_name'].value_counts()

# Male and Female Winners by Category

**Challenge**: Create a [plotly bar chart](https://plotly.com/python/bar-charts/) that shows the split between men and women by category. 
* Hover over the bar chart. How many prizes went to women in Literature compared to Physics?

<img src=https://i.imgur.com/od8TfOp.png width=650>

In [None]:
# create a plotly bar chart that shows the split between men and women by category
df_data['category'].value_counts()

In [None]:
 df_data = px.data.tips()

# Number of Prizes Awarded Over Time

**Challenge**: Are more prizes awarded recently than when the prize was first created? Show the trend in awards visually. 
* Count the number of prizes awarded every year. 
* Create a 5-year rolling average of the number of prizes (Hint: see previous lessons analysing Google Trends).
* Using Matplotlib superimpose the rolling average on a scatter plot.
* Show a tick mark on the x-axis for every 5 years from 1900 to 2020. (Hint: you'll need to use NumPy). 

<img src=https://i.imgur.com/4jqYuWC.png width=650>

* Use the [named colours](https://matplotlib.org/3.1.0/gallery/color/named_colors.html) to draw the data points in `dogerblue` while the rolling average is coloured in `crimson`. 

<img src=https://i.imgur.com/u3RlcJn.png width=350>

* Looking at the chart, did the first and second world wars have an impact on the number of prizes being given out? 
* What could be the reason for the trend in the chart?


In [None]:
# count the number of prizes awarded every year
df_data['year'].value_counts()

In [None]:
# create a 5 year rolling average of the number of prizes
df_data['year'].value_counts()

In [None]:
# using matplotlib superimpose the rolling average on a scatter plot
df_data['year'].value_counts()

In [None]:
# show a tick mark on the x-axis for every 5 years from 1900 to 2020
df_data['year'].value_counts()

# Are More Prizes Shared Than Before?

**Challenge**: Investigate if more prizes are shared than before. 

* Calculate the average prize share of the winners on a year by year basis.
* Calculate the 5-year rolling average of the percentage share.
* Copy-paste the cell from the chart you created above.
* Modify the code to add a secondary axis to your Matplotlib chart.
* Plot the rolling average of the prize share on this chart. 
* See if you can invert the secondary y-axis to make the relationship even more clear. 

In [None]:
# calculate the average prize share of the winners on a year by year basis
df_data = px.data.tips()

df_data['share'].value_counts()


In [None]:
# calculate the 5 year rolling average of the percentage share
px.data.tips()

In [None]:
# copy-paste the cell from the chart you created above
df_data['share'].value_counts()

In [None]:
# modify the code to add a secondary axis to your matplotlib chart
df_data['share'].value_counts()

In [None]:
# plot the rolling average of the prize share on this chart
df_data['share'].value_counts()

# The Countries with the Most Nobel Prizes

**Challenge**: 
* Create a Pandas DataFrame called `top20_countries` that has the two columns. The `prize` column should contain the total number of prizes won. 

<img src=https://i.imgur.com/6HM8rfB.png width=350>

* Is it best to use `birth_country`, `birth_country_current` or `organization_country`? 
* What are some potential problems when using `birth_country` or any of the others? Which column is the least problematic? 
* Then use plotly to create a horizontal bar chart showing the number of prizes won by each country. Here's what you're after:

<img src=https://i.imgur.com/agcJdRS.png width=750>

* What is the ranking for the top 20 countries in terms of the number of prizes?

In [None]:
# create a pandas dataframe called top20_countries that has the two columns
df_data = pd.DataFrame({'prize': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
df_data

In [None]:
# use birth_country, birth_country_current or organization_country
df_data['birth_country'].value_counts()

In [None]:
# what are some potential problems when using birth_country or any of the others
df_data['birth_country'].value_counts()
birth_country = df_data['birth_country'].value_counts()
birth_country

In [None]:
# which column is the least problematic
df_data['birth_country'].value_counts()

In [None]:
# use plotly to create a horizontal bar chart showing the number of prizes won by each country
df_data['birth_country'].value_counts()

# Use a Choropleth Map to Show the Number of Prizes Won by Country

* Create this choropleth map using [the plotly documentation](https://plotly.com/python/choropleth-maps/):

<img src=https://i.imgur.com/s4lqYZH.png>

* Experiment with [plotly's available colours](https://plotly.com/python/builtin-colorscales/). I quite like the sequential colour `matter` on this map. 

Hint: You'll need to use a 3-letter country code for each country.


In [None]:
# create a choropleth map using the plotly documentation
df_data = px.data.gapminder().query("year == 2007")
df_data

In [None]:
# experiment with plotly's available colours
df_data = px.colors.sequential.matter()
df_data

In [None]:
# use a 3 letter country code for each country


In [None]:
# use a choropleth map to show the number of prizes won by country


# In Which Categories are the Different Countries Winning Prizes? 

**Challenge**: See if you can divide up the plotly bar chart you created above to show the which categories made up the total number of prizes. Here's what you're aiming for:

<img src=https://i.imgur.com/iGaIKCL.png>

* In which category are Germany and Japan the weakest compared to the United States?
* In which category does Germany have more prizes than the UK?
* In which categories does France have more prizes than Germany?
* Which category makes up most of Australia's Nobel Prizes?
* Which category makes up half of the prizes in the Netherlands?
* Does the United States have more prizes in Economics than all of France? What about in Physics or Medicine?


The hard part is preparing the data for this chart! 


*Hint*: Take a two-step approach. The first step is grouping the data by country and category. Then you can create a DataFrame that looks something like this:

<img src=https://i.imgur.com/VKjzKa1.png width=450>


In [None]:
# create a dataframe that looks something like this
df_data = country_category.groupby(['country', 'category']).size().reset_index(name='count')

In [None]:
# In which category are Germany and Japan the weakest compared to the United States?
category = df_data['category'].value_counts()
category

In [None]:
# compare Germany and Japan to the United States
df_data = country_category.groupby(['country', 'category']).size().reset_index(name='count')
df_data

### Number of Prizes Won by Each Country Over Time

* When did the United States eclipse every other country in terms of the number of prizes won? 
* Which country or countries were leading previously?
* Calculate the cumulative number of prizes won by each country in every year. Again, use the `birth_country_current` of the winner to calculate this. 
* Create a [plotly line chart](https://plotly.com/python/line-charts/) where each country is a coloured line. 

In [None]:
# when did the United States eclipse every other country in terms of the number of prizes won
df_data = px.data.gapminder().query("year == 2007")

In [None]:
# which country or countries were leading previously
df_data = px.data.gapminder().query("year == 2007")

In [None]:
# calculate the cumulative number of prizes won by each country in every year
df_data = px.data.gapminder().query("year == 2007")

In [None]:
# create a plotly line chart where each country is a coloured line
plotly.express.line(df_data, x="year", y="pop", color="continent", line_group="country", hover_name="country")

# What are the Top Research Organisations?

**Challenge**: Create a bar chart showing the organisations affiliated with the Nobel laureates. It should look something like this:

<img src=https://i.imgur.com/zZihj2p.png width=600>

* Which organisations make up the top 20?
* How many Nobel Prize winners are affiliated with the University of Chicago and Harvard University?

In [None]:
# create a bar chart showing the organisations affiliated with the Nobel laureates
plotly.express.bar(df_data, x="birth_country", y="pop", color="continent", line_group="country", hover_name="country")

In [None]:
# which organisations make up the top 20
organisation = df_data['organisation'].value_counts()

In [None]:
# how many Nobel Prize winners are affiliated with the University of Chicago and Harvard University
df_data = px.data.gapminder().query("year == 2007")

# Which Cities Make the Most Discoveries? 

Where do major discoveries take place?  

**Challenge**: 
* Create another plotly bar chart graphing the top 20 organisation cities of the research institutions associated with a Nobel laureate. 
* Where is the number one hotspot for discoveries in the world?
* Which city in Europe has had the most discoveries?

In [None]:
# create a plotly bar chart graphing the top 20 organisation cities of the research institutions associated with a Nobel laureate
plotly.express.bar(df_data, x="birth_country", y="pop", color="continent", line_group="country", hover_name="country")

In [None]:
# where is the number one hotspot for discoveries in the world
number_of_discoveries = df_data['organisation_city'].value_counts()

In [None]:
Most_discoveries_Europe = df_data['organisation_city'].value_counts()

# Where are Nobel Laureates Born? Chart the Laureate Birth Cities 

**Challenge**: 
* Create a plotly bar chart graphing the top 20 birth cities of Nobel laureates. 
* Use a named colour scale called `Plasma` for the chart.
* What percentage of the United States prizes came from Nobel laureates born in New York? 
* How many Nobel laureates were born in London, Paris and Vienna? 
* Out of the top 5 cities, how many are in the United States?


In [None]:
# create a plotly bar chart graphing the top 20 birth cities of Nobel laureates
plotly.express.bar(df_data, x="birth_country", y="pop", color="continent", line_group="country", hover_name="country")

In [None]:
# what percentage of the United States prizes came from Nobel laureates born in New York
number_of_prizes = df_data['birth_city'].value_counts()

# Plotly Sunburst Chart: Combine Country, City, and Organisation

**Challenge**: 

* Create a DataFrame that groups the number of prizes by organisation. 
* Then use the [plotly documentation to create a sunburst chart](https://plotly.com/python/sunburst-charts/)
* Click around in your chart, what do you notice about Germany and France? 


Here's what you're aiming for:

<img src=https://i.imgur.com/cemX4m5.png width=300>



In [None]:
# create a DataFrame that groups the number of prizes by organisation
organisation = df_data['organisation'].value_counts()

In [None]:
# use the plotly documentation to create a sunburst chart
plotly.express.sunburst(df_data, path=['continent', 'country', 'organisation'], values='pop')

In [None]:
# click around in your chart, what do you notice about Germany and France

# Patterns in the Laureate Age at the Time of the Award

How Old Are the Laureates When the Win the Prize?

**Challenge**: Calculate the age of the laureate in the year of the ceremony and add this as a column called `winning_age` to the `df_data` DataFrame. Hint: you can use [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html) to help you. 



In [None]:
# calculate the age of the laureate in the year of the ceremony and add this as a column called winning_age to the df_data DataFrame
df_data['winning_age'] = df_data['year'] - df_data['birth_date']

### Who were the oldest and youngest winners?

**Challenge**: 
* What are the names of the youngest and oldest Nobel laureate? 
* What did they win the prize for?
* What is the average age of a winner?
* 75% of laureates are younger than what age when they receive the prize?
* Use Seaborn to [create histogram](https://seaborn.pydata.org/generated/seaborn.histplot.html) to visualise the distribution of laureate age at the time of winning. Experiment with the number of `bins` to see how the visualisation changes.

In [None]:
# what are the names of the youngest and oldest Nobel laureate
df_data['winning_age'].max()

In [None]:
# what did they win the prize for
df_data['winning_age'].min()

In [None]:
# what is the average age of a winner
df_data['winning_age'].mean()

In [None]:
# 75% of laureates are younger than what age when they receive the prize
df_data['winning_age'].quantile(0.75)

In [None]:
# use Seaborn to create histogram to visualise the distribution of laureate age at the time of winning
sns.histplot(df_data['winning_age'], bins=10)

### Descriptive Statistics for the Laureate Age at Time of Award

* Calculate the descriptive statistics for the age at the time of the award. 
* Then visualise the distribution in the form of a histogram using [Seaborn's .hist plot() function](https://seaborn.pydata.org/generated/seaborn.histplot.html).
* Experiment with the `bin` size. Try 10, 20, 30, and 50.  

In [None]:
# calculate the descriptive statistics for the age at the time of the award
df_data['winning_age'].describe()

In [None]:
# visualise the distribution in the form of a histogram using Seaborn's .hist plot() function
sns.histplot(df_data['winning_age'], bins=10)

### Age at Time of Award throughout History

Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?

**Challenge**

* Use Seaborn to [create a .reg-plot](https://seaborn.pydata.org/generated/seaborn.regplot.html?highlight=regplot#seaborn.regplot) with a trend-line.
* Set the `lowess` parameter to `True` to show a moving average of the linear fit.
* According to the best fit line, how old were Nobel laureates in the years 1900-1940 when they were awarded the prize?
* According to the best fit line, what age would it predict for a Nobel laureate in 2020?


In [None]:
# use Seaborn to create a .reg-plot with a trend-line
sns.regplot(x=df_data['year'], y=df_data['winning_age'], lowess=True)

### Winning Age Across the Nobel Prize Categories

How does the age of laureates vary by category? 

* Use Seaborn's [`.boxplot()`](https://seaborn.pydata.org/generated/seaborn.boxplot.html?highlight=boxplot#seaborn.boxplot) to show how the mean, quartiles, max, and minimum values vary across categories. Which category has the longest "whiskers"? 
* In which prize category are the average winners the oldest?
* In which prize category are the average winners the youngest?

In [None]:
# use Seaborn's .boxplot() to show how the mean, quartiles, max, and minimum values vary across categories
sns.boxplot(x=df_data['category'], y=df_data['winning_age'])

In [None]:
# which category has the longest "whiskers"

In [None]:
# in which prize category are the average winners the oldest

In [None]:
# in which prize category are the average winners the youngest

In [None]:
# use Seaborn's .boxplot() to show how the mean, quartiles, max, and minimum values vary across categories
sns.boxplot(x=df_data['category'], y=df_data['winning_age'])

**Challenge**
* Now use Seaborn's [`.lmplot()`](https://seaborn.pydata.org/generated/seaborn.lmplot.html?highlight=lmplot#seaborn.lmplot) and the `row` parameter to create 6 separate charts for each prize category. Again set `lowess` to `True`.
* What are the winning age trends in each category? 
* Which category has the age trending up and which category has the age trending down? 
* Is this `.lmplot()` telling a different story from the `.boxplot()`?
* Create another chart with Seaborn. This time use `.lmplot()` to put all 6 categories on the same chart using the `hue` parameter. 


In [None]:
# use Seaborn's .lmplot() and the row parameter to create 6 separate charts for each prize category
sns.lmplot(x='year', y='winning_age', data=df_data, row='category', lowess=True)

In [None]:
# what are the winning age trends in each category

In [None]:
# which category has the age trending up and which category has the age trending down

In [None]:
# is this .lmplot() telling a different story from the .boxplot()