### Import Statements

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sb
import matplotlib.pyplot as plt

### Notebook Presentation

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

### Read the Data

In [None]:
df_data = pd.read_csv('nobel_prize_data.csv')

Caveats: The exact birth dates for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd. 


# Data Exploration & Cleaning

**Challenge**: Preliminary data exploration. 
* What is the shape of `df_data`? How many rows and columns?
    - 962, 16
* What are the column names?
* In which year was the Nobel prize first awarded? 
    - 1901
* Which year is the latest year included in the dataset?
    - 2020

**Challange**: 
* Are there any duplicate values in the dataset?
    - no
* Are there NaN values in the dataset?
    - yes
* Which columns tend to have NaN values?
    - org_name,city,country | motivation
* How many NaN values are there per column? 
* Why do these columns have NaN values?  

### Check for Duplicates

In [None]:
df_data.isna().sum()

### Type Conversions

**Challenge**: 
* Convert the `birth_date` column to Pandas `Datetime` objects
* Add a Column called `share_pct` which has the laureates' share as a percentage in the form of a floating-point number.

#### Convert Year and Birth Date to Datetime

In [None]:
df_data['birth_date'] = pd.to_datetime(df_data['birth_date'])

#### Add a Column with the Prize Share as a Percentage

In [None]:
separated_data = df_data['prize_share'].astype(str).str.split('/', expand=True)
numerator = separated_data[0].astype(np.float64)
denominator = separated_data[1].astype(np.float64)
df_data['share_pct'] = numerator / denominator
df_data.info()

# Plotly Donut Chart: Percentage of Male vs. Female Laureates

**Challenge**: Create a [donut chart using plotly](https://plotly.com/python/pie-charts/) which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?

In [None]:
gender_data = df_data['sex'].value_counts()

fig = px.pie(
    labels=gender_data.index,
    values=gender_data.values,
    title='Male vs Female winners',
    names=gender_data.index,
    hole=0.4
)

fig.update_traces(
    textposition='inside',
    textfont_size=15,
    textinfo='percent'
)

fig.show()

# Who were the first 3 Women to Win the Nobel Prize?

**Challenge**: 
* What are the names of the first 3 female Nobel laureates? 
* What did the win the prize for? 
* What do you see in their `birth_country`? Were they part of an organisation?

In [None]:
df_data.head(1)

In [None]:
df_data.sort_values('year')[df_data['sex'] == 'Female'].head(3)

# Find the Repeat Winners

**Challenge**: Did some people get a Nobel Prize more than once? If so, who were they? 

In [None]:
won_twice = df_data.duplicated(subset=['full_name'], keep=False)
multiple_winners = df_data[won_twice]
multiple_winners[['year', 'category', 'laureate_type', 'full_name']].sort_values('full_name')

# Number of Prizes per Category

**Challenge**: 
* In how many categories are prizes awarded? 
* Create a plotly bar chart with the number of prizes awarded by category. 
* Use the color scale called `Aggrnyl` to colour the chart, but don't show a color axis.
* Which category has the most number of prizes awarded? 
* Which category has the fewest number of prizes awarded? 

In [None]:
df_data['category'].nunique()

In [None]:
winners_per_category = df_data['category'].value_counts()

v_bar = px.bar(
    x=winners_per_category.index,
    y=winners_per_category.values,
    color=winners_per_category.values,
    color_continuous_scale='Aggrnyl',
    title='Prizes Awarded Per Category'
)

v_bar.update_layout(
    xaxis_title='Nobel Prize Category',
    yaxis_title='Number of Prizes',
)

v_bar.show()

**Challenge**: 
* When was the first prize in the field of Economics awarded?
* Who did the prize go to?

In [None]:
df_data[df_data['category'] == 'Economics'].sort_values('year').head(1)

# Male and Female Winners by Category

**Challenge**: Create a [plotly bar chart](https://plotly.com/python/bar-charts/) that shows the split between men and women by category. 
* Hover over the bar chart. How many prizes went to women in Literature compared to Physics?


In [None]:
cat_men_women = df_data.groupby(['category', 'sex'], as_index=False).agg({'prize': pd.Series.count})

v_bar_split = px.bar(
    x=cat_men_women['category'],
    y=cat_men_women['prize'],
    color=cat_men_women['sex'],
    title='Number of Prizes Awarded per Category split By Sex'
)

v_bar_split.update_layout(
    xaxis_title='Nobel Prize Category',
    yaxis_title='Number of Prizes awarded'
)

v_bar_split.show()


# Number of Prizes Awarded Over Time

**Challenge**: Are more prizes awarded recently than when the prize was first created? Show the trend in awards visually. 
* Count the number of prizes awarded every year. 
* Create a 5 year rolling average of the number of prizes (Hint: see previous lessons analysing Google Trends).
* Using Matplotlib superimpose the rolling average on a scatter plot.
* Show a tick mark on the x-axis for every 5 years from 1900 to 2020. (Hint: you'll need to use NumPy). 



* Looking at the chart, did the first and second world wars have an impact on the number of prizes being given out? 
* What could be the reason for the trend in the chart?


In [None]:
prize_per_year = df_data.groupby('year')['prize'].count()
moving_avg = prize_per_year.rolling(window=5).mean()

yearly_avg_share = df_data.groupby('year').agg({'share_pct': pd.Series.mean})
share_moving_avg = yearly_avg_share.rolling(window=5).mean()

In [None]:
fig = plt.figure(figsize=(16, 8), dpi=200)
plt.title('Number of Nobel Prize Awarded over Time', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(
    ticks=np.arange(1900, 2021, step=5),
    fontsize=14,
    rotation=24
)

ax1 = plt.gca() # get currrent axis
ax2 = ax1.twinx() # share same x-axis
ax1.set_xlim(1900, 2020)

ax1.scatter(
    x=prize_per_year.index,
    y=prize_per_year.values,
    c='dodgerblue',
    alpha=0.7,
    s=100,
)

ax1.plot(
    prize_per_year.index,
    moving_avg.values,
    c='crimson',
    linewidth=3
)

ax2.invert_yaxis()
ax2.plot(
    prize_per_year.index,
    share_moving_avg,
    c='grey',
    linewidth=3
)
plt.show()

# The Countries with the Most Nobel Prizes

**Challenge**: 
* Create a Pandas DataFrame called `top20_countries` that has the two columns. The `prize` column should contain the total number of prizes won. 

* Is it best to use `birth_country`, `birth_country_current` or `organization_country`? 
* What are some potential problems when using `birth_country` or any of the others? Which column is the least problematic? 
* Then use plotly to create a horizontal bar chart showing the number of prizes won by each country. Here's what you're after:

* What is the ranking for the top 20 countries in terms of the number of prizes?

In [None]:
top_countries = df_data.groupby('birth_country_current', as_index=False).agg({'prize': pd.Series.count})
top20_countries = top_countries.sort_values('prize')[-18:]

In [None]:
h_bar = px.bar(
    x=top20_countries['prize'],
    y=top20_countries['birth_country_current'],
    orientation='h',
    color=top20_countries['prize'],
    color_continuous_scale='Viridis',
    title='Top 20 Countries by Number of Prizes'
)

h_bar.update_layout(xaxis_title='Number of Prizes', 
                    yaxis_title='Country',
                    coloraxis_showscale=False)

h_bar.show()

# Use a Choropleth Map to Show the Number of Prizes Won by Country

* Create this choropleth map using [the plotly documentation](https://plotly.com/python/choropleth-maps/):

* Experiment with [plotly's available colours](https://plotly.com/python/builtin-colorscales/). I quite like the sequential colour `matter` on this map. 

Hint: You'll need to use a 3 letter country code for each country. 


In [None]:
df_countries = df_data.groupby(['birth_country_current', 'ISO'], as_index=False).agg({'prize': pd.Series.count})

In [None]:
world_map = px.choropleth(
    data_frame=df_countries,
    locations='ISO',
    color='prize',
    hover_name='birth_country_current',
    color_continuous_scale=px.colors.sequential.matter
)

world_map.update_layout(coloraxis_showscale=True,)

world_map.show()


# In Which Categories are the Different Countries Winning Prizes? 

**Challenge**: See if you can divide up the plotly bar chart you created above to show the which categories made up the total number of prizes. Here's what you're aiming for:

In [None]:
cat_country = df_data.groupby(['birth_country_current', 'category'], 
                                as_index=False).agg({'prize': pd.Series.count})
cat_country.sort_values(by='prize', ascending=False, inplace=True)

merge_df = pd.merge(cat_country, top20_countries, on='birth_country_current')
merge_df.columns = ['birth_country_current', 'category', 'cat_prize', 'total_prize']

In [None]:
merge_df.head(1)

In [None]:
cat_cntry_bar = px.bar(
    data_frame=cat_country,
    x=merge_df['cat_prize'],
    y=merge_df['birth_country_current'],
    color=merge_df['category'],
    orientation='h',
    title='Top 20 Countries by Prizes & Category'
)

cat_cntry_bar.update_layout(xaxis_title='Number of Prizes', 
                            yaxis_title='Country')

cat_cntry_bar.show()

### Number of Prizes Won by Each Country Over Time

* When did the United States eclipse every other country in terms of the number of prizes won? 
* Which country or countries were leading previously?
* Calculate the cumulative number of prizes won by each country in every year. Again, use the `birth_country_current` of the winner to calculate this. 
* Create a [plotly line chart](https://plotly.com/python/line-charts/) where each country is a coloured line. 

In [None]:
prize_by_year = df_data.groupby(by=['birth_country_current', 'year'], as_index=False).count()

prize_by_year = prize_by_year.sort_values('year')[['year', 'birth_country_current', 'prize']]
prize_by_year.head()

In [None]:
cumulative_prizes = prize_by_year.groupby(by=['birth_country_current', 'year']).sum().groupby(level=[0]).cumsum()
cumulative_prizes.reset_index(inplace=True) 


In [None]:
prize_chart = px.line(
    data_frame=cumulative_prizes,
    x='year',
    y='prize',
    color='birth_country_current',
    hover_name='birth_country_current'
)

prize_chart.update_layout(xaxis_title='Year',
                      yaxis_title='Number of Prizes')
 
prize_chart.show()

# What are the Top Research Organisations?

In [None]:
df_data.head(1)

In [None]:
top20_org = df_data.groupby(by=['organization_name'])['prize'].count().sort_values(ascending=False)[:20]

In [None]:
top20_org.index

In [None]:
h_bar = px.bar(
    x=top20_org.values[:18],
    y=top20_org.index[:18],
    orientation='h',
    color=top20_org.values[:18],
    color_continuous_scale='Viridis',
    title='Top 20 Organization by Number of Prizes'
)

h_bar.update_layout(xaxis_title='Number of Prizes', 
                    yaxis_title='Institution',
                    coloraxis_showscale=False)

h_bar.show()

# Which Cities Make the Most Discoveries? 

Where do major discoveries take place?  

**Challenge**: 
* Create another plotly bar chart graphing the top 20 organisation cities of the research institutions associated with a Nobel laureate. 
* Where is the number one hotspot for discoveries in the world?
* Which city in Europe has had the most discoveries?

In [None]:
top20_cities = df_data['organization_city'].value_counts()[:18]
top20_cities.sort_values(ascending=True, inplace=True)

city_bar = px.bar(
    x=top20_cities.values,
    y=top20_cities.index,
    orientation='h',
    color=top20_cities.values,
    color_continuous_scale=px.colors.sequential.Plasma,
    title='Which Cities Do the Most Research'
)

city_bar.update_layout(
    xaxis_title='Number of Prizes',
    yaxis_title='City',
)

city_bar.show()

# Where are Nobel Laureates Born? Chart the Laureate Birth Cities 

**Challenge**: 
* Create a plotly bar chart graphing the top 20 birth cities of Nobel laureates. 
* Use a named colour scale called `Plasma` for the chart.
* What percentage of the United States prizes came from Nobel laureates born in New York? 
* How many Nobel laureates were born in London, Paris and Vienna? 
* Out of the top 5 cities, how many are in the United States?


In [None]:
top20_birth_cities = df_data['birth_city'].value_counts()[:18]
top20_cities.sort_values(ascending=True, inplace=True)
city_bar = px.bar(x=top20_cities.values,
                  y=top20_cities.index,
                  orientation='h',
                  color=top20_cities.values,
                  color_continuous_scale=px.colors.sequential.Plasma,
                  title='Where were the Nobel Laureates Born?')
 
city_bar.update_layout(xaxis_title='Number of Prizes', 
                       yaxis_title='City of Birth',
                       coloraxis_showscale=False)
city_bar.show()

# Plotly Sunburst Chart: Combine Country, City, and Organisation

**Challenge**: 

* Create a DataFrame that groups the number of prizes by organisation. 
* Then use the [plotly documentation to create a sunburst chart](https://plotly.com/python/sunburst-charts/)
* Click around in your chart, what do you notice about Germany and France? 


In [None]:
# 1 country = multiple cities
# 1 city = multiple institutions

country_city_org = df_data.groupby(
    by=['organization_country', 'organization_city', 'organization_name'],
    as_index=False,
).agg({'prize': pd.Series.count})

country_city_org.sort_values('prize', ascending=False)

In [None]:
burst = px.sunburst(
    data_frame=country_city_org,
    path=['organization_country', 'organization_city', 'organization_name'],
    values='prize',
    title='Where do Discoveries Take Place'
)

burst.update_layout(
    xaxis_title='Number of Prizes',
    yaxis_title='City'
)

burst.show()

# Patterns in the Laureate Age at the Time of the Award

How Old Are the Laureates When the Win the Prize?

**Challenge**: Calculate the age of the laureate in the year of the ceremony and add this as a column called `winning_age` to the `df_data` DataFrame. Hint: you can use [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html) to help you. 



In [None]:
laureates_birthyear = pd.DatetimeIndex(df_data['birth_date']).year
df_data['winning_age'] = df_data['year'] - laureates_birthyear


In [None]:
oldest_winner = df_data[df_data['winning_age'] == df_data['winning_age'].max()]
oldest_winner

In [None]:
df_data['winning_age'].describe()

### Who were the oldest and youngest winners?

**Challenge**: 
* What are the names of the youngest and oldest Nobel laureate? 
* What did they win the prize for?
* What is the average age of a winner?
* 75% of laureates are younger than what age when they receive the prize?
* Use Seaborn to [create histogram](https://seaborn.pydata.org/generated/seaborn.histplot.html) to visualise the distribution of laureate age at the time of winning. Experiment with the number of `bins` to see how the visualisation changes.

In [None]:
plt.figure(figsize=(8, 4), dpi=200)
sb.histplot(
    data=df_data,
    x=df_data['winning_age'],
    bins=30
)

plt.xlabel('Age')
plt.title('Distribution of Age on Receipt of Prize')
plt.show()

### Age at Time of Award throughout History

Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?

**Challenge**

* Use Seaborn to [create a .regplot](https://seaborn.pydata.org/generated/seaborn.regplot.html?highlight=regplot#seaborn.regplot) with a trendline.
* Set the `lowess` parameter to `True` to show a moving average of the linear fit.
* According to the best fit line, how old were Nobel laureates in the years 1900-1940 when they were awarded the prize?
* According to the best fit line, what age would it predict for a Nobel laureate in 2020?


In [None]:
plt.figure(figsize=(8, 4), dpi=200)
with sb.axes_style('whitegrid'):
    sb.regplot(
        data=df_data,
        x='year',
        y='winning_age',
        scatter_kws={'alpha': 0.4},
        line_kws={'color':'black'}
    )
    
    plt.show()

### Winning Age Across the Nobel Prize Categories

How does the age of laureates vary by category? 

* Use Seaborn's [`.boxplot()`](https://seaborn.pydata.org/generated/seaborn.boxplot.html?highlight=boxplot#seaborn.boxplot) to show how the mean, quartiles, max, and minimum values vary across categories. Which category has the longest "whiskers"? 
* In which prize category are the average winners the oldest?
* In which prize category are the average winners the youngest?

In [None]:
plt.figure(figsize=(8, 4), dpi=200)
with sb.axes_style('whitegrid'):
    sb.boxplot(
        data=df_data,
        x='category',
        y='winning_age'
    )
    
    plt.show()

**Challenge**
* Now use Seaborn's [`.lmplot()`](https://seaborn.pydata.org/generated/seaborn.lmplot.html?highlight=lmplot#seaborn.lmplot) and the `row` parameter to create 6 separate charts for each prize category. Again set `lowess` to `True`.
* What are the winning age trends in each category? 
* Which category has the age trending up and which category has the age trending down? 
* Is this `.lmplot()` telling a different story from the `.boxplot()`?
* Create another chart with Seaborn. This time use `.lmplot()` to put all 6 categories on the same chart using the `hue` parameter. 


In [None]:
with sb.axes_style('whitegrid'):
    sb.lmplot(
        data=df_data,
        x='year',
        y='winning_age',
        row='category',
        lowess=True,
        aspect=2,
        scatter_kws={'alpha': 0.4},
        line_kws={'color': 'black'}
    )
    plt.show()

In [None]:
with sb.axes_style("whitegrid"):
    sb.lmplot(data=df_data,
                x='year',
                y='winning_age',
                hue='category',
                lowess=True, 
                aspect=2,
                scatter_kws={'alpha': 0.5},
                line_kws={'linewidth': 2})
    
plt.show()